ToaST: Tokenization with Split Trees Reduces Token Count by 11%+ Over BPE/WordPiece/UnigramLM
ToaST (Tokenization with Split Trees) is a new subword tokenization method that uses a recursive binary split-tree inference procedure and Integer Programming-based vocabulary selection to directly optimize compression. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, effectively extending context length for models using it. In 1.5B parameter LM training experiments, ToaST achieves the highest CORE benchmark score, outperforming baselines by 2.6%–7.6% across 22 tasks. The LP relaxation of the vocabulary selection IP is near-integral in practice, yielding provably near-optimal vocabularies.
Related guides (3)
Related events (8)
ConvexTok: Tokeniser Construction via Linear Programming and Convex Optimisation
This paper proposes ConvexTok, a new tokenisation algorithm that formulates vocabulary construction as a linear program solved via convex optimisation, replacing the greedy approaches used by BPE and Unigram. ConvexTok consistently improves intrinsic tokenisation metrics and bits-per-byte (BpB) for language models, with less consistent gains on downstream tasks. A key feature is the ability to certify proximity to optimality via a lower bound, with empirical results showing the algorithm is within 1% of optimal at common vocabulary sizes.
Adaptive asymmetric token compression accelerates time series language models up to 7.68×
A new arXiv preprint proposes an adaptive token budgeting framework for time series (TS) language models that compresses TS tokens using frequency-domain structure and progressively prunes prompt tokens across model layers. The authors demonstrate up to 7.68× inference acceleration with performance improvements in 78% of evaluated settings across forecasting, classification, imputation, and anomaly detection tasks. The work is motivated by the observation that TS tokens have uneven spectral contributions and prompt-token influence attenuates with model depth, making uniform token processing wasteful.
CLP: Lightweight collocation-length predictor achieves zero-loss multi-token inference speedup
Researchers propose CLP (Collocation-Length Predictor), a span-level decision layer for accelerating LLM inference via multi-token prediction without quality degradation. The key insight is 'Backbone-as-Architect': the backbone LM head always generates the first token while MTP heads handle only subsequent tokens, eliminating head-backbone competition that causes repetitive outputs in prior methods. CLP uses a single linear layer (~4.6K–7.7K parameters) versus 1M-parameter gate networks in prior work, achieving 1.14x–1.29x speedup on Qwen2.5 models with near-zero repetition ratio. The paper also establishes that shorter prediction horizons improve MTP head accuracy on larger models, offering a scaling-aware design principle.
ATWU: Token-level importance learning improves LLM unlearning via retain-conflict criterion
This paper introduces Alternating Token-Weighted Unlearning (ATWU), a framework that learns which tokens in a forget sample are most relevant to unlearning by characterizing their conflict with the retain objective. Rather than relying on auxiliary models or heuristics, ATWU jointly learns token forget-specificity and model parameters using a lightweight linear scorer over hidden states. Evaluated on TOFU and RWKU benchmarks, ATWU achieves state-of-the-art forget-retain trade-offs and produces token-level scores that align with ground-truth forget-specific spans.
UniAudio-Token: Semantic Speech Tokenizer with General Audio Perception for Audio-LLMs
UniAudio-Token is a framework from Tencent that extends semantic speech tokenizers—commonly used as interfaces for Audio-LLMs—to support general audio perception without sacrificing speech quality. It introduces two mechanisms: Semantic-Acoustic Primitives (SAP) for structured supervision decomposing audio into linguistic, vocal, and auditory-scene components, and Semantic-Acoustic Equilibrium (SAE), a content-aware gating mechanism that restores fine-grained acoustic details from shallow layers. Evaluations show it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks when integrated with downstream LLMs. Code, training/inference scripts, and model checkpoints are publicly released.
TokenPilot: Dual-granularity context management cuts LLM agent inference costs by up to 87%
TokenPilot is a cache-efficient context management framework for LLM agents that addresses the trade-off between token sparsity and prompt cache continuity. It combines Ingestion-Aware Compaction (global prefix stabilization) with Lifecycle-Aware Eviction (local segment offloading) to reduce inference costs by 56–87% across benchmarks while maintaining competitive task performance. The system is evaluated on PinchBench and Claw-Eval and has been integrated into the open-source LightMem2 library.
Attention Expansion mechanism improves keyphrase extraction from long documents without full-context LLMs
Researchers propose an 'attention expansion' mechanism that augments pre-trained language model token representations with information from out-of-context chunks using static word embeddings, enabling more effective keyphrase extraction from long documents. The approach avoids the computational cost of full-document attention or LLM-based inference while expanding the effective contextual scope of PLM-based models. Evaluated across five PLM backbones and five benchmark corpora, the method consistently improves F1 scores over state-of-the-art baselines in both scientific and news domains.
Pair-In, Pair-Out (PIPO): Unified Latent Compression and Multi-Token Prediction for Efficient LLM Inference
PIPO is a new inference efficiency framework that unifies input-side latent compression with output-side multi-token prediction (MTP) by treating them as mirror operations: a compressor folds two input tokens into one latent, while an MTP head unfolds one hidden state into an additional output token. To avoid the expensive verifier pass typically required by speculative decoding, PIPO trains a lightweight confidence head using On-Policy Distillation (OPD), which naturally aligns with rejection-sampling criteria. Experiments on Qwen3.5-4B and 9B backbones across AIME 2025, GPQA-Diamond, LiveCodeBench v6, and LongBench v2 show up to 2.64× first-token-latency speedup and +7.15 pass@4 improvement over regular decoding.


