technique
LangMAP
techniqueactiveprovisional
langmap-21026d05·1 events·first seen 9h agoAliases: LangMAP
Co-occurring entities
More like this (12)
Recent events (1)
LangMAP: Language-adaptive tokenization from a shared vocabulary without language identification at inference
LangMAP (Language-adaptive Maximum a Posteriori Tokenization) extends the UnigramLM algorithm to produce language-specific tokenizations from a single shared vocabulary, eliminating the need to retrain models or swap vocabularies for multilingual settings. A key property is that language labels are only required at training time; inference proceeds without language identification. Evaluated across 14 tokenizers, 9 natural languages, and 9 programming languages, LangMAP improves morphological boundary alignment and AST-leaf alignment for all coding languages tested. Fine-tuning results are mixed: consistent gains on grammatical acceptability (MultiBLiMP) but less consistent on knowledge tasks (Global-PIQA, Belebele).