Entity · technique

Byte Pair Encoding (BPE)

techniqueactivebyte-pair-encoding-bpe--f0a3f222·2 events·first seen May 22, 2026

Aliases: Byte Pair Encoding (BPE)

Co-occurring entities

UnigramLM Renyi efficiency WordPiece ToaST (Tokenization with Split Trees)CORE benchmark Integer Programming (IP)linear programming bits-per-byte (BpB)Unigram tokenisation ConvexTok

More like this (12)

bits-per-byte (BpB)Byte-Prefix Marginalization Pair-In, Pair-Out (PIPO)binary quantization Parallel Box Decoding Positional Encoding opencode bitsandbytes BitNet b1.58 UBP2 BytePlus BitNet

Recent events (2)

6arXiv · cs.CL·May 22, 2026·source ↗

ToaST: Tokenization with Split Trees Reduces Token Count by 11%+ Over BPE/WordPiece/UnigramLM

ToaST (Tokenization with Split Trees) is a new subword tokenization method that uses a recursive binary split-tree inference procedure and Integer Programming-based vocabulary selection to directly optimize compression. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, effectively extending context length for models using it. In 1.5B parameter LM training experiments, ToaST achieves the highest CORE benchmark score, outperforming baselines by 2.6%–7.6% across 22 tasks. The LP relaxation of the vocabulary selection IP is near-integral in practice, yielding provably near-optimal vocabularies.

Long Context Evolution Frontier Model Releases Byte Pair Encoding (BPE)UnigramLM Renyi efficiency +5 more

5arXiv · cs.LG·May 22, 2026·source ↗

ConvexTok: Tokeniser Construction via Linear Programming and Convex Optimisation

This paper proposes ConvexTok, a new tokenisation algorithm that formulates vocabulary construction as a linear program solved via convex optimisation, replacing the greedy approaches used by BPE and Unigram. ConvexTok consistently improves intrinsic tokenisation metrics and bits-per-byte (BpB) for language models, with less consistent gains on downstream tasks. A key feature is the ability to certify proximity to optimality via a lower bound, with empirical results showing the algorithm is within 1% of optimal at common vocabulary sizes.

Evaluation and Benchmarking Byte Pair Encoding (BPE)linear programming bits-per-byte (BpB)+2 more