Entity · technique

UnigramLM

techniqueactiveunigramlm-74e67aca·2 events·first seen May 22, 2026

Aliases: UnigramLM

Co-occurring entities

Global-PIQA MultiBLiMP Belebele LangMAP Byte Pair Encoding (BPE)Renyi efficiency WordPiece ToaST (Tokenization with Split Trees)CORE benchmark Integer Programming (IP)

More like this (12)

Unigram watermarking Unigram tokenisation MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment In-Place Tokenizer Expansion for Pre-trained LLMs SmolLM GLM MinGram SmolLM3 CO-LMLM KenLM SmolVLM 3LM

Recent events (2)

5arXiv · cs.CL·Jun 23, 2026·source ↗

LangMAP: Language-adaptive tokenization from a shared vocabulary without language identification at inference

LangMAP (Language-adaptive Maximum a Posteriori Tokenization) extends the UnigramLM algorithm to produce language-specific tokenizations from a single shared vocabulary, eliminating the need to retrain models or swap vocabularies for multilingual settings. A key property is that language labels are only required at training time; inference proceeds without language identification. Evaluated across 14 tokenizers, 9 natural languages, and 9 programming languages, LangMAP improves morphological boundary alignment and AST-leaf alignment for all coding languages tested. Fine-tuning results are mixed: consistent gains on grammatical acceptability (MultiBLiMP) but less consistent on knowledge tasks (Global-PIQA, Belebele).

Evaluation and Benchmarking Open Weights Progress UnigramLM Global-PIQA MultiBLiMP +2 more

6arXiv · cs.CL·May 22, 2026·source ↗

ToaST: Tokenization with Split Trees Reduces Token Count by 11%+ Over BPE/WordPiece/UnigramLM

ToaST (Tokenization with Split Trees) is a new subword tokenization method that uses a recursive binary split-tree inference procedure and Integer Programming-based vocabulary selection to directly optimize compression. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, effectively extending context length for models using it. In 1.5B parameter LM training experiments, ToaST achieves the highest CORE benchmark score, outperforming baselines by 2.6%–7.6% across 22 tasks. The LP relaxation of the vocabulary selection IP is near-integral in practice, yielding provably near-optimal vocabularies.

Long Context Evolution Frontier Model Releases Byte Pair Encoding (BPE)UnigramLM Renyi efficiency +5 more