Almanac
technique

Tapered Language Models

techniqueactiveprovisionaltapered-language-models-759157f5·1 events·first seen 2d ago

Aliases: Tapered Language Models

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.LG·2d ago·source ↗

Tapered Language Models: front-loading parameter capacity improves perplexity at no extra cost

Researchers introduce Tapered Language Models (TLMs), an architectural principle that allocates more parameter capacity to earlier layers and less to later layers via a cosine-scheduled MLP width taper, under a fixed total budget. Controlled experiments across three model scales and four architectures (Transformer, Gated Attention, Hope-attention, Titans) show consistent perplexity and downstream benchmark improvements over uniform-width baselines. The finding reframes depth-uniform parameter allocation — a default inherited from the original transformer — as a suboptimal choice, offering a free architectural lever applicable across modern LM families.