model

GDN-2

modelactiveprovisionalgdn-2-a50482bd·1 events·first seen 3d ago

Aliases: GDN-2

Co-occurring entities

WikiText-2 CARVE WY-form triangular chunk solver RULER

More like this (12)

GAD-7 Gated DeltaNet-2 NDCG R-NaD SGSD nDCG@10 GPT-2-small P-K-GCN GDELT HPDv2 MDM-VGB ASL-2

Recent events (1)

6arXiv · cs.CL·3d ago·source ↗

CARVE: Content-aware gating for linear attention recurrent models improves efficiency and quality over GDN-2

CARVE (Content-Aware Recurrent with Value Efficiency) is a new linear attention architecture that addresses three coupled defects in the GDN-2 delta-rule architecture by restricting erasure to the key axis rather than the value axis. This design choice is proven necessary and sufficient to enable the WY-form triangular chunk solver, enabling competitive training throughput with Transformers. At 1.3B parameters trained on 100B tokens, CARVE achieves lower perplexity than GDN-2, leads recurrent baselines on nine commonsense reasoning benchmarks, and sets state-of-the-art on RULER retrieval probes, while using 13% less peak memory and 19% fewer parameters at 0.4% throughput overhead.

Training Infrastructure Long Context Evolution WikiText-2 CARVE GDN-2 +2 more