technique
UnigramLM
techniqueactive
unigramlm-74e67aca·1 events·first seen 25d agoAliases: UnigramLM
Co-occurring entities
More like this (12)
Recent events (1)
ToaST: Tokenization with Split Trees Reduces Token Count by 11%+ Over BPE/WordPiece/UnigramLM
ToaST (Tokenization with Split Trees) is a new subword tokenization method that uses a recursive binary split-tree inference procedure and Integer Programming-based vocabulary selection to directly optimize compression. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, effectively extending context length for models using it. In 1.5B parameter LM training experiments, ToaST achieves the highest CORE benchmark score, outperforming baselines by 2.6%–7.6% across 22 tasks. The LP relaxation of the vocabulary selection IP is near-integral in practice, yielding provably near-optimal vocabularies.