Almanac
paper

Variable-Width Transformers

paperactiveprovisionalvariable-width-transformers-20799a4c·1 events·first seen 11h ago

Aliases: Variable-Width Transformers

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·11h ago·source ↗

Variable-Width Transformers: X-shaped architecture outperforms uniform-width baselines with 22% fewer FLOPs

Researchers propose the ><former (X-shaped transformer), a decoder-only architecture that uses wider early and late layers with narrower middle layers, implemented via a parameter-free residual resizing mechanism. Evaluated on models from 200M to 2B dense parameters and 3B MoE, the architecture consistently outperforms parameter-matched uniform-width baselines on language modeling loss. The design yields a 22% reduction in FLOPs and 15% reduction in KV cache memory under fitted scaling curves, suggesting nonuniform width allocation is a viable path to more compute-efficient language models.