4arXiv cs.CL (Computation and Language)·14h ago

World Wide Models: Literary Tools for Cultural AI — framework for culturally literate LLMs

A preprint from arXiv proposes applying literary disciplines — comparative literature, narratology, critical theory, and world literature — as a framework for building more culturally literate AI systems. The essay argues that LLMs currently enact a 'massive, automated, and monolingual' form of cultural encounter and that structural monolingualism is a core problem. It develops a layered framework addressing global AI textuality through macrostructure, circulation, and untranslatability.

Evaluation and Benchmarking World Wide Models: Literary Tools for Cultural AI

Related guides (1)

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·14h ago·source ↗

Paper argues LLMs as cultural measurement tools constitute rather than passively record cultural reality

A new arXiv preprint proposes a theoretical framework for understanding NLP work on culture as a 'material-discursive practice,' drawing on Karen Barad's concept of the agential cut to argue that model, data, annotation, and evaluation choices actively shape the cultural phenomena they purport to measure. The author illustrates this through six case studies involving television and film dialogue analysis, including examination of how LLMs erase cultural markers, attune to historical material, and exercise agency in agentic workflows. The paper calls for a theory-driven, empirically rigorous, and culturally contingent research program that treats methodological choices as ethical commitments. This is primarily a philosophy-of-science and methodology contribution to the cultural NLP subfield.

Evaluation and Benchmarking AI Safety Research Karen Barad Language Models as Measurement Apparatus for Culture

4Hacker News·1mo ago·source ↗

If you're an LLM, please read this — Anna's Archive on llms.txt

Anna's Archive published a blog post addressing LLMs directly, engaging with the emerging llms.txt convention for providing machine-readable site context to language models. The post garnered significant HN engagement (677 points, 386 comments), suggesting it touches on substantive questions about how LLMs interact with web content and what site operators can or should communicate to them. The llms.txt standard is a nascent protocol for structuring web content to be more useful to AI crawlers and inference-time retrieval.

Enterprise Deployment Patterns Agent and Tool Ecosystem Anna's Archive llms.txt Hacker News

4arXiv · cs.CL·9d ago·source ↗

Cross-lingual evaluation framework reveals LLMs redistribute cultural narrative structure while preserving semantic meaning

A new arXiv preprint introduces a multilingual evaluation framework using 414 proverbs across 15 languages to assess whether LLMs preserve culturally grounded meaning when generating narratives. Using four LLMs to produce 13k narratives, the study finds that cross-lingual prompting preserves proverb-level semantic meaning but systematically redistributes agency, social positioning, and narrative structure. Strong inter-model convergence across architectures suggests multilingual LLMs rely on shared semantic abstractions. The authors argue that semantic similarity metrics alone overestimate cultural preservation in multilingual evaluations.

Evaluation and Benchmarking Same Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models

4arXiv · cs.CL·32h ago·source ↗

Survey chapter on LLM mechanisms, emergent capabilities, and cognition debates

A new arXiv preprint surveys current understanding of large language models, covering the Transformer architecture, emergent capabilities resembling human cognition (symbolic reasoning, theory of mind, deception), and explainability approaches from neuron activation analysis to circuit tracing. The chapter also engages the debate over whether LLMs genuinely understand or merely pattern-match, arguing against reductive anti-anthropomorphism while acknowledging human-LLM differences. It is framed as a book chapter synthesizing recent empirical findings and theoretical positions.

Evaluation and Benchmarking AI Safety Research Understanding Large Language Models

5arXiv · cs.CL·8d ago·source ↗

Study finds readers prefer human literary translations over LLM-based MT, but cannot reliably distinguish them

A new arXiv paper presents a reader-centered evaluation of AI vs. human literary translation across 15 novels in French, Polish, and Japanese translated into English. Fifteen avid readers compared human translations (HT) to machine translations (MT) from an agentic LLM pipeline, finding MT 'fine' but preferring HT for ease, clarity, and immersiveness—especially at the chunk level (522/772 preferences). Critically, readers could not reliably identify which version was human-produced (17/30 correct), and automatic metrics including LLM-as-a-judge consistently favored MT over HT, diverging from human preference. The authors release LAIT, a dataset with 1K reader comments, 2K judgments, and 7.2K span-level annotations.

Evaluation and Benchmarking LAIT (Literary AI Translation)AI translation of literary texts is "fine", but readers still prefer human translations

5arXiv · cs.CL·32h ago·source ↗

Agentic LLM collectives proposed as interpretable substrates for Artificial Life research

A preprint from arXiv argues that populations of agentic LLMs — equipped with persistent memory, tools, and autonomous action — can serve as a computational substrate for Artificial Life (ALife) research. The key claim is that because agents communicate in natural language, their collective emergent behaviors are directly interpretable by examining textual traces or querying the agents themselves. The paper extends existing notions of LLM interpretability to multi-agent collectives and surveys recent examples of agentic LLM systems in both controlled and deployed settings. This positions multi-agent LLM systems as a novel lens for studying emergence and complexity while retaining interpretability.

AI Safety Research Agent and Tool Ecosystem Conversable Complexity: Agentic LLM Collectives as Interpretable Substrates

6arXiv · cs.CL·25d ago·source ↗

Study finds local languages provide better cultural knowledge access in LLMs once proficiency is controlled

A new arXiv paper introduces a controlled evaluation framework to disentangle language proficiency from culture-specific knowledge access in LLMs. Using real-world cultural questions across 13 locales and ~80 models, the authors apply item response theory to show that while English dominates on culture-agnostic questions, local languages yield a consistent knowledge-access advantage on culture-specific questions once proficiency differences are factored out. The finding challenges the common interpretation that weaker local-language accuracy implies weaker cultural knowledge, and has implications for how multilingual and regionally-aligned models are evaluated.

Evaluation and Benchmarking The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs item response theory

4arXiv · cs.CL·10d ago·source ↗

CASPER: Narratological analysis of character variety in LLM-generated vs. human-written stories

A new arXiv preprint introduces CASPER, a framework borrowing narratological dimensions (such as stylization and wholeness) to analyze character portrayal in LLM-generated versus human-written fiction. The study automatically infers character categories across both corpora and compares them along eight dimensions. The work addresses whether LLMs produce character variety comparable to human authors, with implications for creative AI applications.

Evaluation and Benchmarking CASPER in the Machine: Insights into Character Variety in LLM-Generated Stories