technique
Tapered Language Models
techniqueactiveprovisional
tapered-language-models-759157f5·1 events·first seen 2d agoAliases: Tapered Language Models
Co-occurring entities
More like this (12)
Language Model FinetuningTransformer Language Modelsmulti-turn language modelsLanguage Modeling LossLanguage Models are Few-Shot Learners1B-scale language modelsReasoning Language ModelsArithmetic Pedagogy for Language ModelsDiffusion Language ModelsLatent Context Language Modelsencoder-only language modelsScaling Laws for Neural Language Models
Recent events (1)
Tapered Language Models: front-loading parameter capacity improves perplexity at no extra cost
Researchers introduce Tapered Language Models (TLMs), an architectural principle that allocates more parameter capacity to earlier layers and less to later layers via a cosine-scheduled MLP width taper, under a fixed total budget. Controlled experiments across three model scales and four architectures (Transformer, Gated Attention, Hope-attention, Titans) show consistent perplexity and downstream benchmark improvements over uniform-width baselines. The finding reframes depth-uniform parameter allocation — a default inherited from the original transformer — as a suboptimal choice, offering a free architectural lever applicable across modern LM families.