paper
Variable-Width Transformers
paperactiveprovisional
variable-width-transformers-20799a4c·1 events·first seen 11h agoAliases: Variable-Width Transformers
Co-occurring entities
More like this (12)
Training-Free Looped Transformerstransformer architectureDynamic Short Convolutions Improve TransformersFixed-Point Reasoners: Stable and Adaptive Deep Looped TransformersSwift TransformersViT (Vision Transformer)autoregressive transformerfeed-forward transformerSparse TransformerConditional Diffusion TransformerDifferential TransformerTransformers (library)
Recent events (1)
Variable-Width Transformers: X-shaped architecture outperforms uniform-width baselines with 22% fewer FLOPs
Researchers propose the ><former (X-shaped transformer), a decoder-only architecture that uses wider early and late layers with narrower middle layers, implemented via a parameter-free residual resizing mechanism. Evaluated on models from 200M to 2B dense parameters and 3B MoE, the architecture consistently outperforms parameter-matched uniform-width baselines on language modeling loss. The design yields a 22% reduction in FLOPs and 15% reduction in KV cache memory under fitted scaling curves, suggesting nonuniform width allocation is a viable path to more compute-efficient language models.