Almanac
technique

Unigram tokenisation

techniqueactiveunigram-tokenisation-57ef690d·1 events·first seen 26d ago

Aliases: Unigram tokenisation

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.LG·26d ago·source ↗

ConvexTok: Tokeniser Construction via Linear Programming and Convex Optimisation

This paper proposes ConvexTok, a new tokenisation algorithm that formulates vocabulary construction as a linear program solved via convex optimisation, replacing the greedy approaches used by BPE and Unigram. ConvexTok consistently improves intrinsic tokenisation metrics and bits-per-byte (BpB) for language models, with less consistent gains on downstream tasks. A key feature is the ability to certify proximity to optimality via a lower bound, with empirical results showing the algorithm is within 1% of optimal at common vocabulary sizes.