Entity · technique

Unigram tokenisation

techniqueactiveunigram-tokenisation-57ef690d·1 events·first seen May 22, 2026

Aliases: Unigram tokenisation

Co-occurring entities

Byte Pair Encoding (BPE)linear programming bits-per-byte (BpB)ConvexTok

More like this (12)

Unigram watermarking UnigramLM MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment Tokenizers TOBA tokenizer In-Place Tokenizer Expansion for Pre-trained LLMs Content is What Remains: Invariant Speech Tokenization from Parallel Utterances intra-frame token sparsification How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation hexagonal spatial tokenization Transformer Grammars SearchGen-Corpus-1M

Recent events (1)

5arXiv · cs.LG·May 22, 2026·source ↗

ConvexTok: Tokeniser Construction via Linear Programming and Convex Optimisation

This paper proposes ConvexTok, a new tokenisation algorithm that formulates vocabulary construction as a linear program solved via convex optimisation, replacing the greedy approaches used by BPE and Unigram. ConvexTok consistently improves intrinsic tokenisation metrics and bits-per-byte (BpB) for language models, with less consistent gains on downstream tasks. A key feature is the ability to certify proximity to optimality via a lower bound, with empirical results showing the algorithm is within 1% of optimal at common vocabulary sizes.

Evaluation and Benchmarking Byte Pair Encoding (BPE)linear programming bits-per-byte (BpB)+2 more