Almanac
benchmark

Belebele

benchmarkactiveprovisionalbelebele-1d1e15b8·1 events·first seen 9h ago

Aliases: Belebele

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·9h ago·source ↗

LangMAP: Language-adaptive tokenization from a shared vocabulary without language identification at inference

LangMAP (Language-adaptive Maximum a Posteriori Tokenization) extends the UnigramLM algorithm to produce language-specific tokenizations from a single shared vocabulary, eliminating the need to retrain models or swap vocabularies for multilingual settings. A key property is that language labels are only required at training time; inference proceeds without language identification. Evaluated across 14 tokenizers, 9 natural languages, and 9 programming languages, LangMAP improves morphological boundary alignment and AST-leaf alignment for all coding languages tested. Fine-tuning results are mixed: consistent gains on grammatical acceptability (MultiBLiMP) but less consistent on knowledge tasks (Global-PIQA, Belebele).