technique

Hope-attention

techniqueactiveprovisionalhope-attention-10bdb3f7·1 events·first seen 2d ago

Aliases: Hope-attention

Co-occurring entities

More like this (12)

Lightning Attention positional attention heads global attention bidirectional attention reference attention cross-attention symbolic attention heads Functional Attention optimism bias ProbSparse Attention Attention-based MIL attention head circuit

Recent events (1)

6arXiv · cs.LG·2d ago·source ↗

Tapered Language Models: front-loading parameter capacity improves perplexity at no extra cost

Researchers introduce Tapered Language Models (TLMs), an architectural principle that allocates more parameter capacity to earlier layers and less to later layers via a cosine-scheduled MLP width taper, under a fixed total budget. Controlled experiments across three model scales and four architectures (Transformer, Gated Attention, Hope-attention, Titans) show consistent perplexity and downstream benchmark improvements over uniform-width baselines. The finding reframes depth-uniform parameter allocation — a default inherited from the original transformer — as a suboptimal choice, offering a free architectural lever applicable across modern LM families.

Training Infrastructure Frontier Model Releases Titans Hope-attention Tapered Language Models +1 more