5arXiv cs.CL (Computation and Language)·8d ago

Adaptive asymmetric token compression accelerates time series language models up to 7.68×

A new arXiv preprint proposes an adaptive token budgeting framework for time series (TS) language models that compresses TS tokens using frequency-domain structure and progressively prunes prompt tokens across model layers. The authors demonstrate up to 7.68× inference acceleration with performance improvements in 78% of evaluated settings across forecasting, classification, imputation, and anomaly detection tasks. The work is motivated by the observation that TS tokens have uneven spectral contributions and prompt-token influence attenuates with model depth, making uniform token processing wasteful.

Long Context Evolution Inference Economics Beyond Uniform Tokens: Adaptive Compression for Time Series Language Models

Related guides (2)

Long Context EvolutionTopic guide

Long Context Evolution: From Bigger Windows to Smarter Memory

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·11d ago·source ↗

CLP: Lightweight collocation-length predictor achieves zero-loss multi-token inference speedup

Researchers propose CLP (Collocation-Length Predictor), a span-level decision layer for accelerating LLM inference via multi-token prediction without quality degradation. The key insight is 'Backbone-as-Architect': the backbone LM head always generates the first token while MTP heads handle only subsequent tokens, eliminating head-backbone competition that causes repetitive outputs in prior methods. CLP uses a single linear layer (~4.6K–7.7K parameters) versus 1M-parameter gate networks in prior work, achieving 1.14x–1.29x speedup on Qwen2.5 models with near-zero repetition ratio. The paper also establishes that shorter prediction horizons improve MTP head accuracy on larger models, offering a scaling-aware design principle.

Inference Economics Qwen2.5 Alibaba CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference +2 more

6arXiv · cs.CL·29d ago·source ↗

ToaST: Tokenization with Split Trees Reduces Token Count by 11%+ Over BPE/WordPiece/UnigramLM

ToaST (Tokenization with Split Trees) is a new subword tokenization method that uses a recursive binary split-tree inference procedure and Integer Programming-based vocabulary selection to directly optimize compression. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, effectively extending context length for models using it. In 1.5B parameter LM training experiments, ToaST achieves the highest CORE benchmark score, outperforming baselines by 2.6%–7.6% across 22 tasks. The LP relaxation of the vocabulary selection IP is near-integral in practice, yielding provably near-optimal vocabularies.

Long Context Evolution Frontier Model Releases Byte Pair Encoding (BPE)UnigramLM Renyi efficiency +5 more

5arXiv · cs.CL·9d ago·source ↗

SKIM: Adaptive soft-token compression for procedural skills in LLM workflows

Researchers introduce SKIM (SKIll coMpression), a multi-resolution soft token compression framework targeting procedural knowledge (skills/workflows) rather than factual documents. SKIM compresses reusable natural language skills to 30–60% of their original token length while preserving task performance, reducing prefill cost and latency when skills are repeatedly invoked. The method adapts compression depth to skill complexity and supports offline compression for frequently updated community skills.

Inference Economics Agent and Tool Ecosystem Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models SKIM (SKIll coMpression)

6The Batch·19d ago·source ↗

Test-Time Training End-to-End (TTT-E2E) Retrains Model Weights to Handle Long Inputs

Researchers from Astera Institute, Nvidia, Stanford, UC Berkeley, and UC San Diego introduced TTT-E2E, a method that compresses long context into transformer weights by training the model during inference via meta-learning. The approach uses sliding-window attention restricted to 8,000 tokens and updates only the fully connected layers of the last quarter of the network on each 1,000-token chunk at inference time, keeping per-token generation latency roughly constant as context scales to 128,000 tokens. TTT-E2E slightly outperforms vanilla transformers on next-token prediction loss across long contexts and matches efficient architectures like Mamba 2 and Gated DeltaNet on inference speed, but fails dramatically on Needle-in-a-Haystack retrieval beyond 8,000 tokens and incurs substantially higher training latency. The work reframes long-context handling as a training-inference trade-off rather than an architectural design problem.

Training Infrastructure Long Context Evolution University of California San Diego Mamba Stanford University +13 more

4arXiv · cs.CL·2d ago·source ↗

CADE framework proposes direct timestep embedding and contrastive alignment for time-series question answering

A new arXiv preprint introduces CADE (Contrastive Alignment with Direct Embedding), a framework for time-series question answering (TSQA) that bypasses the tokenization bottleneck of standard LLMs by mapping each timestep directly into the LLM embedding space via a point-wise linear encoder and MLP projector. The approach also introduces a one-directional supervised contrastive loss to align time-series embeddings with frozen class-name text anchors, bridging the semantic gap between numerical and language representations. Evaluated on the Time-MQA benchmark across six TSQA tasks, CADE outperforms both open-source and proprietary LLM baselines. The work addresses a concrete limitation of patch-based encoders — fixed granularity and poor cross-dataset transfer — with a cleaner architectural alternative.

Evaluation and Benchmarking Multimodal Progress Time-MQA Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering CADE

5arXiv · cs.CL·11d ago·source ↗

ADAS: Attention-Discounted Adaptive Sampler improves parallel decoding for masked diffusion language models

Researchers propose ADAS, a training-free reranking rule for masked diffusion language model decoding that addresses token interaction failures in parallel token commitment. The method greedily penalizes candidates that attend strongly to already-selected uncertain positions, using attention weights as soft marginal penalties rather than hard constraints. Evaluated on LLaDA-8B-Base and Dream-7B-Base across GSM8K, MATH500, HumanEval, and MBPP, ADAS improves low-NFE performance by 9–10 percentage points on average when plugged into existing samplers with only 3.1% runtime overhead.

Frontier Model Releases Inference Economics LLaDA-8B-Base MATH500 EB-Sampler +6 more

3Simon Willison'S Weblog·1mo ago·source ↗

How fast is 10 tokens per second really?

Simon Willison offers commentary on the practical perception of LLM inference speed, specifically examining what 10 tokens per second means to end users. The piece contextualizes token generation rates against human reading speed and usability thresholds. This is a qualitative analysis relevant to understanding inference economics and user experience expectations for deployed language models.

Inference Economics Simon Willison

4arXiv · cs.AI·12d ago·source ↗

COMPACT-VA: Planning-aligned token compression for long-context autonomous driving

Researchers introduce COMPACT-VA, a working memory framework using conditional VQ-VAE to compress extended temporal context in vision-action autonomous driving models. Compression is conditioned on historical trajectory and a learned planning intent derived from future trajectories during training, enabling end-to-end optimization without backbone modifications. On high-signal dynamic scenarios, the method achieves 68.3% success rate (>6% improvement) with 3.3x speedup and 2.7x memory reduction over uncompressed processing.

Long Context Evolution Inference Economics conditional VQ-VAE Planning-aligned Token Compression for Long-Context Autonomous Driving COMPACT-VA