4arXiv cs.CL (Computation and Language)·13h ago

Paper argues LLMs as cultural measurement tools constitute rather than passively record cultural reality

A new arXiv preprint proposes a theoretical framework for understanding NLP work on culture as a 'material-discursive practice,' drawing on Karen Barad's concept of the agential cut to argue that model, data, annotation, and evaluation choices actively shape the cultural phenomena they purport to measure. The author illustrates this through six case studies involving television and film dialogue analysis, including examination of how LLMs erase cultural markers, attune to historical material, and exercise agency in agentic workflows. The paper calls for a theory-driven, empirically rigorous, and culturally contingent research program that treats methodological choices as ethical commitments. This is primarily a philosophy-of-science and methodology contribution to the cultural NLP subfield.

Evaluation and Benchmarking AI Safety Research Karen Barad Language Models as Measurement Apparatus for Culture

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Principles to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·13h ago·source ↗

World Wide Models: Literary Tools for Cultural AI — framework for culturally literate LLMs

A preprint from arXiv proposes applying literary disciplines — comparative literature, narratology, critical theory, and world literature — as a framework for building more culturally literate AI systems. The essay argues that LLMs currently enact a 'massive, automated, and monolingual' form of cultural encounter and that structural monolingualism is a core problem. It develops a layered framework addressing global AI textuality through macrostructure, circulation, and untranslatability.

Evaluation and Benchmarking World Wide Models: Literary Tools for Cultural AI

4arXiv · cs.CL·9d ago·source ↗

Cross-lingual evaluation framework reveals LLMs redistribute cultural narrative structure while preserving semantic meaning

A new arXiv preprint introduces a multilingual evaluation framework using 414 proverbs across 15 languages to assess whether LLMs preserve culturally grounded meaning when generating narratives. Using four LLMs to produce 13k narratives, the study finds that cross-lingual prompting preserves proverb-level semantic meaning but systematically redistributes agency, social positioning, and narrative structure. Strong inter-model convergence across architectures suggests multilingual LLMs rely on shared semantic abstractions. The authors argue that semantic similarity metrics alone overestimate cultural preservation in multilingual evaluations.

Evaluation and Benchmarking Same Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models

6arXiv · cs.CL·25d ago·source ↗

Study finds local languages provide better cultural knowledge access in LLMs once proficiency is controlled

A new arXiv paper introduces a controlled evaluation framework to disentangle language proficiency from culture-specific knowledge access in LLMs. Using real-world cultural questions across 13 locales and ~80 models, the authors apply item response theory to show that while English dominates on culture-agnostic questions, local languages yield a consistent knowledge-access advantage on culture-specific questions once proficiency differences are factored out. The finding challenges the common interpretation that weaker local-language accuracy implies weaker cultural knowledge, and has implications for how multilingual and regionally-aligned models are evaluated.

Evaluation and Benchmarking The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs item response theory

4arXiv · cs.CL·31h ago·source ↗

Survey chapter on LLM mechanisms, emergent capabilities, and cognition debates

A new arXiv preprint surveys current understanding of large language models, covering the Transformer architecture, emergent capabilities resembling human cognition (symbolic reasoning, theory of mind, deception), and explainability approaches from neuron activation analysis to circuit tracing. The chapter also engages the debate over whether LLMs genuinely understand or merely pattern-match, arguing against reductive anti-anthropomorphism while acknowledging human-LLM differences. It is framed as a book chapter synthesizing recent empirical findings and theoretical positions.

Evaluation and Benchmarking AI Safety Research Understanding Large Language Models

5arXiv · cs.CL·8d ago·source ↗

LLM-based classification exposes keyword lexicon artifacts in computational social science stance measurement

A new arXiv preprint demonstrates that statistically significant findings in computational social science can be entirely measurement artifacts of keyword-based scoring instruments. Analyzing 85 interviews across four public intellectuals, the authors show that keyword-based certainty scores produce strong correlations (r=0.72–0.93) that collapse or invert when replaced with LLM zero-shot semantic classification on 32,625 sentences. The paper identifies three structural failure modes in keyword lexicons—syntactic blindness, polysemy blindness, and categorical absence—and argues that keyword counts measure lexical co-occurrence tendencies rather than rhetorical stance. The work has implications for the validity of prior NLP-based social science research and for the comparative utility of LLMs as measurement instruments.

Evaluation and Benchmarking When Certainty Is an Artifact: Keyword Lexicon Blindness and the (Mis)Measurement of Rhetorical Stance

6arXiv · cs.AI·23d ago·source ↗

Paper challenges LLM expert-level claims by measuring variance and error magnitude in code-based data analysis tasks

A new arXiv paper argues that standard LLM benchmarks overstate model capabilities by focusing on average performance on training-data-adjacent tasks while ignoring response variance and error magnitude. The authors introduce a novel benchmark requiring frontier LLMs to write code for data analysis tasks, comparing results against human expert submissions. Human experts outperformed the frontier LLM on average across multiple metrics and showed lower performance variability. The findings challenge the prevailing narrative that LLMs perform at human-expert level on knowledge economy tasks.

Frontier Model Releases Evaluation and Benchmarking Flaws in the LLM Automation Narrative

5arXiv · cs.CL·28d ago·source ↗

LLMs fail to consistently simulate demographic perspective-taking in hate speech annotation

A new arXiv paper evaluates whether persona-conditioned LLMs can replicate how different demographic groups perceive hate speech, testing three dimensions: inter-group disagreement, in-group sensitivity, and vicarious prediction. No model consistently captures all three dimensions, and performance is highly model-dependent rather than emerging reliably from identity prompts alone. Vicarious prompting with Llama 3.1 provides the closest approximation to human disagreement patterns across demographic axes. The findings have implications for using LLMs as proxies for diverse human annotators in content moderation tasks.

Evaluation and Benchmarking AI Safety Research From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation Meta Llama-3.1-8B

5arXiv · cs.CL·31h ago·source ↗

Agentic LLM collectives proposed as interpretable substrates for Artificial Life research

A preprint from arXiv argues that populations of agentic LLMs — equipped with persistent memory, tools, and autonomous action — can serve as a computational substrate for Artificial Life (ALife) research. The key claim is that because agents communicate in natural language, their collective emergent behaviors are directly interpretable by examining textual traces or querying the agents themselves. The paper extends existing notions of LLM interpretability to multi-agent collectives and surveys recent examples of agentic LLM systems in both controlled and deployed settings. This positions multi-agent LLM systems as a novel lens for studying emergence and complexity while retaining interpretability.

AI Safety Research Agent and Tool Ecosystem Conversable Complexity: Agentic LLM Collectives as Interpretable Substrates