Almanac
technique

Byte Pair Encoding (BPE)

techniqueactivebyte-pair-encoding-bpe--f0a3f222·2 events·first seen 26d ago

Aliases: Byte Pair Encoding (BPE)

Co-occurring entities

More like this (12)

Recent events (2)

5arXiv · cs.LG·26d ago·source ↗

ConvexTok: Tokeniser Construction via Linear Programming and Convex Optimisation

This paper proposes ConvexTok, a new tokenisation algorithm that formulates vocabulary construction as a linear program solved via convex optimisation, replacing the greedy approaches used by BPE and Unigram. ConvexTok consistently improves intrinsic tokenisation metrics and bits-per-byte (BpB) for language models, with less consistent gains on downstream tasks. A key feature is the ability to certify proximity to optimality via a lower bound, with empirical results showing the algorithm is within 1% of optimal at common vocabulary sizes.

6arXiv · cs.CL·25d ago·source ↗

ToaST: Tokenization with Split Trees Reduces Token Count by 11%+ Over BPE/WordPiece/UnigramLM

ToaST (Tokenization with Split Trees) is a new subword tokenization method that uses a recursive binary split-tree inference procedure and Integer Programming-based vocabulary selection to directly optimize compression. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, effectively extending context length for models using it. In 1.5B parameter LM training experiments, ToaST achieves the highest CORE benchmark score, outperforming baselines by 2.6%–7.6% across 22 tasks. The LP relaxation of the vocabulary selection IP is near-integral in practice, yielding provably near-optimal vocabularies.