5arXiv cs.CL (Computation and Language)·46h ago

Paper argues LLMs learn causal structure via difference-making logic, not Pearl/Rubin frameworks

A new arXiv preprint proposes that LLMs learn causal structure through 'variational induction' — a difference-making logic — rather than through the dominant formalisms of Judea Pearl's interventionist approach or the Neyman-Rubin potential outcomes framework. The author analyzes how this logic is realized during training and maps specific architectural features (token embeddings, self-attention) to their roles in this inductive process. The argument draws a parallel between LLM causal learning and the experimental method of systematically varying circumstances. This is a theoretical contribution to understanding how LLMs represent causal and world-model structure.

Alignment and RLHF Judea Pearl self-attention Words as Difference Makers: How Large Language Models Determine Causal Structure in Text

Related guides (1)

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·19d ago·source ↗

Study compares human and LLM active causal reasoning, finding LLMs less efficient but near human-level on conjunctive rules

A new arXiv paper investigates whether active exploration reduces the 'conjunctive handicap' in causal learning, using a blicket detector task with adult participants who could freely intervene to identify causal objects. Results show active exploration substantially improves human conjunctive causal reasoning, though conjunctive rules still require more tests than disjunctive ones. State-of-the-art LLMs approach human-level hypothesis inference accuracy but show less efficient exploration strategies and similar conjunctive-disjunctive performance gaps, raising questions about the nature of LLM causal reasoning.

Evaluation and Benchmarking Human Adults and LLMs as Scientists: Who Benefits from Active Exploration?

5arXiv · cs.CL·15d ago·source ↗

Causal evaluation framework for learnability of formal language tasks in LMs

A new arXiv preprint proposes a causal framework for evaluating how much task-specific data language models need to learn a given task. The authors use formal languages generated by probabilistic finite automata as a controlled testbed, introducing the 'binning semiring' algebraic object to control property frequency in training corpora. Experiments show that standard correlational evaluation practices produce incorrect learnability conclusions due to confounders, with implications for how natural-language task learning is studied.

Evaluation and Benchmarking Kullback-Leibler divergence Causally Evaluating the Learnability of Formal Language Tasks binning semiring

6arXiv · cs.AI·12d ago·source ↗

Study finds shared pattern-matching mechanisms underlie both human and LLM everyday reasoning errors

A new arXiv paper evaluates human participants and 25 LLMs on commonsense causal reasoning tasks, finding similar error patterns in both groups. The authors identify specific attention heads driving LLM responses that implement pattern-matching, and show these heads can predict human reasoning errors caused by superficially irrelevant prompt details. The findings challenge the common assumption that human reasoning relies on principled abstract world models while LLMs merely pattern-match, suggesting both may share a more unified cognitive mechanism.

Evaluation and Benchmarking AI Safety Research Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning

6arXiv · cs.CL·1mo ago·source ↗

Tracing the Emergence of Human-Like Pragmatic Reasoning in LLMs Across Languages

Researchers conducted a population-matching experiment evaluating 25 LLMs on conditional inference tasks across four languages, comparing model behavior to matched human populations. The study finds that LLMs function as accurate semantic operators but systematically fail to capture pragmatic enrichments—context-sensitive inferences beyond literal logical meaning—that humans apply effortlessly. Model performance on pragmatic reasoning is not predicted by open vs. closed weights, training orientation, or architecture type, suggesting pragmatic reasoning remains an emergent and unreliable capability. The findings contribute to ongoing debates about whether LLMs reason like humans or merely approximate surface-level linguistic patterns.

Frontier Model Releases Evaluation and Benchmarking large language models Population-Matching Experiment Pragmatic Reasoning +1 more

3arXiv · cs.CL·8d ago·source ↗

Revisiting LLM systematicity in negation understanding via in-context learning

A new arXiv preprint analyzes how well large language models handle negation from two angles: behavioral systematicity (whether models correctly recognize negation expressions and scope) and representational systematicity (whether function vectors can be reliably constructed from in-context examples). Results show LLMs partially succeed at negation cue recognition via in-context learning but struggle with scope recognition, with performance varying by output format. Function vectors can be composed for cue extraction but are harder to extract for scope recognition tasks.

Evaluation and Benchmarking Revisiting the Systematicity in Negation in the Era of In-Context Learning

5arXiv · cs.AI·16d ago·source ↗

Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance

A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.

Evaluation and Benchmarking AI Safety Research How reliable are LLMs when it comes to playing dice?How reliable are LLMs when it comes to playing dice?

5arXiv · cs.CL·23d ago·source ↗

LLMs Show Inverted Compositional Strengths vs. Humans on Reference Resolution Task

This paper evaluates LLMs and humans on the Personal Relation Task (Paperno 2022), distinguishing between Extensional tasks (resolving what an expression refers to) and Intensional tasks (representing structured sense/formula). The study finds that humans outperform LLMs on Extensional tasks while LLMs outperform humans on Intensional tasks—an inverted pattern of strengths. The authors argue this asymmetry reflects the absence of referential grounding in LLM training as a key gap in human-like language understanding.

Evaluation and Benchmarking Alignment and RLHF large language models referential grounding compositional generalization +2 more

4arXiv · cs.CL·5d ago·source ↗

Mechanistic analysis of how LLMs encode essay quality in internal representations

Researchers systematically probe the hidden representations of eight LLMs across three essay datasets (ASAP++, CSEE, ENEM) to understand how automated essay scoring (AES) works internally. Using linear probing, dimensionality reduction, and neuron-level analysis, they find essay quality is encoded in a linearly accessible form that emerges progressively across layers and partially transfers across prompts. Individual 'essay scoring neurons' are identified whose activations correlate with scores and respond to targeted interventions, with longer essays relying more on deeper layers. The work contributes to mechanistic interpretability of LLM-based scoring systems.

Evaluation and Benchmarking From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models CSEE ENEM +1 more