technique
Unigram tokenisation
techniqueactive
unigram-tokenisation-57ef690d·1 events·first seen 26d agoAliases: Unigram tokenisation
Co-occurring entities
More like this (12)
UnigramLMTokenizersTOBA tokenizerintra-frame token sparsificationhexagonal spatial tokenizationTransformer GrammarsToaST (Tokenization with Split Trees)Alternating Token-Weighted UnlearningUniAudio-Tokensentence embeddingsSpeaker Group Encoding in Self-supervised Speech Recognition Modelszero-shot systematizer
Recent events (1)
ConvexTok: Tokeniser Construction via Linear Programming and Convex Optimisation
This paper proposes ConvexTok, a new tokenisation algorithm that formulates vocabulary construction as a linear program solved via convex optimisation, replacing the greedy approaches used by BPE and Unigram. ConvexTok consistently improves intrinsic tokenisation metrics and bits-per-byte (BpB) for language models, with less consistent gains on downstream tasks. A key feature is the ability to certify proximity to optimality via a lower bound, with empirical results showing the algorithm is within 1% of optimal at common vocabulary sizes.