Almanac
paper

End-to-End Context Compression at Scale

paperactiveprovisionalend-to-end-context-compression-at-scale-0b48ce35·1 events·first seen 8d ago

Aliases: End-to-End Context Compression at Scale

Co-occurring entities

More like this (12)

Recent events (1)

7arXiv · cs.CL·8d ago·source ↗

Latent Context Language Models (LCLMs) achieve competitive encoder-decoder KV cache compression at scale

Researchers introduce Latent Context Language Models (LCLMs), a family of encoder-decoder compressors that map long token sequences to shorter latent embeddings consumed by a decoder, targeting the KV cache memory bottleneck in long-context inference. The authors conduct architecture search and continually pre-train 0.6B-encoder/4B-decoder models on over 350B tokens at compression ratios of 1:4, 1:8, and 1:16. LCLMs improve the Pareto frontier across general-task performance, compression speed, and peak memory, and are demonstrated as efficient backbones for long-horizon agents that can skim compressed context and expand relevant segments on demand. The work closes a previously noted gap between encoder-decoder approaches and KV cache compression methods on the accuracy-efficiency frontier.