technique
ConvexTok
techniqueactive
convextok-74f30b2a·1 events·first seen 26d agoAliases: ConvexTok
Co-occurring entities
More like this (12)
Recent events (1)
ConvexTok: Tokeniser Construction via Linear Programming and Convex Optimisation
This paper proposes ConvexTok, a new tokenisation algorithm that formulates vocabulary construction as a linear program solved via convex optimisation, replacing the greedy approaches used by BPE and Unigram. ConvexTok consistently improves intrinsic tokenisation metrics and bits-per-byte (BpB) for language models, with less consistent gains on downstream tasks. A key feature is the ability to certify proximity to optimality via a lower bound, with empirical results showing the algorithm is within 1% of optimal at common vocabulary sizes.