5arXiv cs.LG (Machine Learning)·29d ago

ConvexTok: Tokeniser Construction via Linear Programming and Convex Optimisation

This paper proposes ConvexTok, a new tokenisation algorithm that formulates vocabulary construction as a linear program solved via convex optimisation, replacing the greedy approaches used by BPE and Unigram. ConvexTok consistently improves intrinsic tokenisation metrics and bits-per-byte (BpB) for language models, with less consistent gains on downstream tasks. A key feature is the ability to certify proximity to optimality via a lower bound, with empirical results showing the algorithm is within 1% of optimal at common vocabulary sizes.

Evaluation and Benchmarking Byte Pair Encoding (BPE)linear programming bits-per-byte (BpB)Unigram tokenisation ConvexTok

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·29d ago·source ↗

ToaST: Tokenization with Split Trees Reduces Token Count by 11%+ Over BPE/WordPiece/UnigramLM

ToaST (Tokenization with Split Trees) is a new subword tokenization method that uses a recursive binary split-tree inference procedure and Integer Programming-based vocabulary selection to directly optimize compression. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, effectively extending context length for models using it. In 1.5B parameter LM training experiments, ToaST achieves the highest CORE benchmark score, outperforming baselines by 2.6%–7.6% across 22 tasks. The LP relaxation of the vocabulary selection IP is near-integral in practice, yielding provably near-optimal vocabularies.

Long Context Evolution Frontier Model Releases Byte Pair Encoding (BPE)UnigramLM Renyi efficiency +5 more

6arXiv · cs.LG·22d ago·source ↗

HullFT: Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching

HullFT is a new method for test-time finetuning (TTFT) of language models that addresses the dual bottlenecks of retrieval quality and per-query finetuning cost. It represents query embeddings as sparse convex combinations of training sequences using Frank-Wolfe optimization, yielding diverse and relevant support sets without expensive diversity-aware search. A geometric integerization step converts fractional weights into integer multiplicities, enabling a Gradient Reuse scheme that amortizes forward-backward computation across repeated examples. Experiments show improved quality-efficiency tradeoffs over prior TTFT methods, measured in bits-per-byte at lower total runtime.

Inference Economics Agent and Tool Ecosystem Test-Time Finetuning (TTFT)Gradient Reuse bits-per-byte (BpB)+2 more

5arXiv · cs.LG·1mo ago·source ↗

TrajTok: Adaptive Spatial Tokenization for Trajectory Representation Learning

TrajTok is a trajectory encoder that learns transferable GPS trace representations via multi-resolution hexagonal spatial tokenization and masked-token pretraining. It uses a factorized transformer with per-modality self-attention, cross-attention fusion, and spatiotemporal rotary position embeddings (ST-RoPE) to jointly encode geometry and kinematics. A single frozen TrajTok encoder with lightweight adapters outperforms task-specific methods on trajectory similarity search, classification, ETA, and travel-time regression on the Porto dataset. The work positions learned spatial tokenization plus masked pretraining as a viable path toward general-purpose trajectory foundation models.

Long Context Evolution Agent and Tool Ecosystem hexagonal spatial tokenization Porto dataset ST-RoPE +2 more

5arXiv · cs.CL·8d ago·source ↗

Adaptive asymmetric token compression accelerates time series language models up to 7.68×

A new arXiv preprint proposes an adaptive token budgeting framework for time series (TS) language models that compresses TS tokens using frequency-domain structure and progressively prunes prompt tokens across model layers. The authors demonstrate up to 7.68× inference acceleration with performance improvements in 78% of evaluated settings across forecasting, classification, imputation, and anomaly detection tasks. The work is motivated by the observation that TS tokens have uneven spectral contributions and prompt-token influence attenuates with model depth, making uniform token processing wasteful.

Long Context Evolution Inference Economics Beyond Uniform Tokens: Adaptive Compression for Time Series Language Models

5Hugging Face Blog·1mo ago·source ↗

Tokenization in Transformers v5: Simpler, Clearer, and More Modular

Hugging Face's Transformers v5 introduces a redesigned tokenization system aimed at being simpler, clearer, and more modular. The blog post outlines architectural changes to how tokenizers are structured and used within the library. This represents a significant API and design evolution for one of the most widely used ML frameworks in the ecosystem.

Inference Economics Agent and Tool Ecosystem Transformers Hugging Face Tokenizers

5arXiv · cs.LG·26d ago·source ↗

Good Token Hunting: Token Selection Framework for Visual Geometry Transformers

This paper introduces a two-stage token selection framework to address the quadratic computational scaling of global attention in visual geometry transformers used for multi-view 3D reconstruction. The approach combines diversity-based inter-frame selection (frame-level) with entropy-guided intra-frame sparsification (token-level within frames). Experiments demonstrate over 85% acceleration for 500-image scenes while maintaining or improving baseline reconstruction quality, offering a favorable speed-accuracy trade-off.

Inference Economics Agent and Tool Ecosystem inter-frame token selection visual geometry transformer global attention +5 more

5arXiv · cs.AI·11d ago·source ↗

CLP: Lightweight collocation-length predictor achieves zero-loss multi-token inference speedup

Researchers propose CLP (Collocation-Length Predictor), a span-level decision layer for accelerating LLM inference via multi-token prediction without quality degradation. The key insight is 'Backbone-as-Architect': the backbone LM head always generates the first token while MTP heads handle only subsequent tokens, eliminating head-backbone competition that causes repetitive outputs in prior methods. CLP uses a single linear layer (~4.6K–7.7K parameters) versus 1M-parameter gate networks in prior work, achieving 1.14x–1.29x speedup on Qwen2.5 models with near-zero repetition ratio. The paper also establishes that shorter prediction horizons improve MTP head accuracy on larger models, offering a scaling-aware design principle.

Inference Economics Qwen2.5 Alibaba CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference +2 more

6The Batch·18d ago·source ↗

Apple's AToken: A Unified Multimodal Tokenizer and Encoder for Images, Videos, and 3D Objects

Apple researchers introduced AToken, a transformer model with a single 4D tokenizer and encoder-decoder architecture that handles images, videos, and 3D objects in a shared token space. The model is trained to both reconstruct and classify all three media types, using a pretrained SigLIP2 vision encoder extended to four dimensions with 4D Rotary Position Embedding. AToken approaches or matches specialized models on image classification (82.2% ImageNet), image generation (0.21 rFID), and 3D reconstruction (28.28 PSNR), while remaining competitive on video tasks. The work addresses a longstanding tension between generation-focused and classification-focused encoders by forcing embeddings to retain both fine visual detail and semantic content.

Frontier Model Releases Multimodal Progress FLUX.1-dev Rotary Position Embedding (RoPE)Jiasen Lu +8 more