technique
Hope-attention
techniqueactiveprovisional
hope-attention-10bdb3f7·1 events·first seen 2d agoAliases: Hope-attention
Co-occurring entities
More like this (12)
Recent events (1)
Tapered Language Models: front-loading parameter capacity improves perplexity at no extra cost
Researchers introduce Tapered Language Models (TLMs), an architectural principle that allocates more parameter capacity to earlier layers and less to later layers via a cosine-scheduled MLP width taper, under a fixed total budget. Controlled experiments across three model scales and four architectures (Transformer, Gated Attention, Hope-attention, Titans) show consistent perplexity and downstream benchmark improvements over uniform-width baselines. The finding reframes depth-uniform parameter allocation — a default inherited from the original transformer — as a suboptimal choice, offering a free architectural lever applicable across modern LM families.