4arXiv cs.CL (Computation and Language)·15d ago

Study compares human and LLM active causal reasoning, finding LLMs less efficient but near human-level on conjunctive rules

A new arXiv paper investigates whether active exploration reduces the 'conjunctive handicap' in causal learning, using a blicket detector task with adult participants who could freely intervene to identify causal objects. Results show active exploration substantially improves human conjunctive causal reasoning, though conjunctive rules still require more tests than disjunctive ones. State-of-the-art LLMs approach human-level hypothesis inference accuracy but show less efficient exploration strategies and similar conjunctive-disjunctive performance gaps, raising questions about the nature of LLM causal reasoning.

Evaluation and Benchmarking Human Adults and LLMs as Scientists: Who Benefits from Active Exploration?

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·8d ago·source ↗

Study finds shared pattern-matching mechanisms underlie both human and LLM everyday reasoning errors

A new arXiv paper evaluates human participants and 25 LLMs on commonsense causal reasoning tasks, finding similar error patterns in both groups. The authors identify specific attention heads driving LLM responses that implement pattern-matching, and show these heads can predict human reasoning errors caused by superficially irrelevant prompt details. The findings challenge the common assumption that human reasoning relies on principled abstract world models while LLMs merely pattern-match, suggesting both may share a more unified cognitive mechanism.

Evaluation and Benchmarking AI Safety Research Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning

6arXiv · cs.CL·1mo ago·source ↗

Tracing the Emergence of Human-Like Pragmatic Reasoning in LLMs Across Languages

Researchers conducted a population-matching experiment evaluating 25 LLMs on conditional inference tasks across four languages, comparing model behavior to matched human populations. The study finds that LLMs function as accurate semantic operators but systematically fail to capture pragmatic enrichments—context-sensitive inferences beyond literal logical meaning—that humans apply effortlessly. Model performance on pragmatic reasoning is not predicted by open vs. closed weights, training orientation, or architecture type, suggesting pragmatic reasoning remains an emergent and unreliable capability. The findings contribute to ongoing debates about whether LLMs reason like humans or merely approximate surface-level linguistic patterns.

Frontier Model Releases Evaluation and Benchmarking large language models Population-Matching Experiment Pragmatic Reasoning +1 more

5arXiv · cs.CL·19d ago·source ↗

LLMs Show Inverted Compositional Strengths vs. Humans on Reference Resolution Task

This paper evaluates LLMs and humans on the Personal Relation Task (Paperno 2022), distinguishing between Extensional tasks (resolving what an expression refers to) and Intensional tasks (representing structured sense/formula). The study finds that humans outperform LLMs on Extensional tasks while LLMs outperform humans on Intensional tasks—an inverted pattern of strengths. The authors argue this asymmetry reflects the absence of referential grounding in LLM training as a key gap in human-like language understanding.

Evaluation and Benchmarking Alignment and RLHF large language models referential grounding compositional generalization +2 more

5arXiv · cs.AI·12d ago·source ↗

Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance

A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.

Evaluation and Benchmarking AI Safety Research How reliable are LLMs when it comes to playing dice?How reliable are LLMs when it comes to playing dice?

5arXiv · cs.CL·11d ago·source ↗

Causal evaluation framework for learnability of formal language tasks in LMs

A new arXiv preprint proposes a causal framework for evaluating how much task-specific data language models need to learn a given task. The authors use formal languages generated by probabilistic finite automata as a controlled testbed, introducing the 'binning semiring' algebraic object to control property frequency in training corpora. Experiments show that standard correlational evaluation practices produce incorrect learnability conclusions due to confounders, with implications for how natural-language task learning is studied.

Evaluation and Benchmarking Kullback-Leibler divergence Causally Evaluating the Learnability of Formal Language Tasks binning semiring

6arXiv · cs.CL·25d ago·source ↗

CausaLab: Scalable Benchmark for Interactive Causal Discovery by LLM Agents

CausaLab is a new evaluation environment that tests LLM agents on interactive causal discovery tasks, requiring them to recover both causal graphs and structural equations from synthetic laboratory episodes governed by randomly sampled structural causal models (SCMs). The benchmark separates predictive accuracy from genuine causal understanding, revealing a persistent gap: GPT-5.2-high achieves 92% task accuracy in a 6-node observational setting but only 0.471 all-edge F1 for mechanism recovery. Mixed observation-intervention strategies improve structural fidelity, while pure intervention strategies underperform on both metrics. Premature stopping is identified as a key agent weakness, partially mitigated by prompting models to verify hypothesis-data consistency.

Evaluation and Benchmarking AI Safety Research all-edge F1 GPT-5.2-high causal discovery +3 more

5arXiv · cs.CL·4d ago·source ↗

Systematic study of extrinsic and intrinsic properties for effective code interpreter reasoning in LLMs

Researchers investigate what behavioral properties make LLMs effective at reasoning with a Code Interpreter (CI), identifying two axes: extrinsic 'crucial tokens' and intrinsic 'cognitive behaviors' such as verification, backtracking, and backward chaining. Stronger CI reasoning models consistently exhibit higher prevalence of these properties. The paper shows that appending code-specific crucial tokens at inference time improves performance on mathematical, ordering, and optimization tasks, while augmenting training with cognitive behaviors improves SFT and RL performance in two of three evaluated models. The work also finds these behaviors reduce overthinking in incorrect responses and improve token efficiency.

Evaluation and Benchmarking Agent and Tool Ecosystem Exploring Extrinsic and Intrinsic Properties for Effective Reasoning with Code Interpreter

5arXiv · cs.CL·9d ago·source ↗

Systematic study reveals effectiveness-fluency trade-offs in LLM conditioning methods

A new arXiv paper systematically evaluates a range of LLM conditioning methods across both concept injection and removal scenarios, finding that efficient steering methods often degrade fluency significantly. A key finding is that activation steering is substantially less effective on instruction-tuned models than on base models, a previously overlooked interaction. Simple prompting and supervised fine-tuning work for concept injection but not removal, and cheap textual metrics are found to correlate well with expensive LLM-as-judge evaluations.

Evaluation and Benchmarking Alignment and RLHF On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study