How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation
how-surprising-is-historical-italian-to-language-models-tokenization-tax-comprehension-tax-and-a-simple-mitigation-8c3b4ba1·1 events·first seen 2d agoAliases: How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation
Co-occurring entities
More like this (12)
Recent events (1)
Diagnostic framework decomposes LLM difficulty on historical Italian and Russian texts
A new arXiv preprint proposes a four-dimensional framework for measuring LLM difficulty on historical language: tokenization cost, surprisal, semantic robustness, and context sensitivity. Evaluated on 17th-century Italian, 19th-century Italian, and 18th-century Russian texts, the study finds that tokenization penalties (25-30% inflation) are similar across languages but predictive difficulty diverges sharply—early modern Italian is 2.4x more surprising than modern Italian while Russian shows only modest increase. Crucially, embedding similarity remains high (>0.85) even when generation is unstable, and a simple temporal context prompt reduces historical surprisal by ~60%. The findings have practical implications for deploying LLMs in digital library and historical document workflows.