6arXiv cs.CL (Computation and Language)·34h ago

Study finds LLM-generated research ideas cluster around synthesis and bridging, diverging from human distribution

A new arXiv paper introduces a large-scale evaluation framework for comparing LLM-generated research ideas against human-authored ones, using reverse-engineered prior-work sets as prompts. The authors develop a two-axis taxonomy of research taste (opportunity pattern and research paradigm) and find a consistent distributional gap: LLMs over-index on bridge-like opportunities and synthesis methods, while human researchers spread more broadly across framing and contribution types. The result suggests current LLMs produce reasonable but systematically narrower and shifted ideation relative to human researchers.

Evaluation and Benchmarking Agent and Tool Ecosystem Measuring the Gap Between Human and LLM Research Ideas Measuring the Gap Between Human and LLM Research Ideas

Related guides (2)

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·23d ago·source ↗

Paper challenges LLM expert-level claims by measuring variance and error magnitude in code-based data analysis tasks

A new arXiv paper argues that standard LLM benchmarks overstate model capabilities by focusing on average performance on training-data-adjacent tasks while ignoring response variance and error magnitude. The authors introduce a novel benchmark requiring frontier LLMs to write code for data analysis tasks, comparing results against human expert submissions. Human experts outperformed the frontier LLM on average across multiple metrics and showed lower performance variability. The findings challenge the prevailing narrative that LLMs perform at human-expert level on knowledge economy tasks.

Frontier Model Releases Evaluation and Benchmarking Flaws in the LLM Automation Narrative

5arXiv · cs.AI·17d ago·source ↗

LLM vs. first-year PhD student on EconCS research: workflow study using stable menus of public goods

A preprint uses an open problem from EC 2025 as a testbed to evaluate AI-assisted research workflows in economics and computer science. The study examines whether human intuition in prompts, multi-turn interaction, and LLM capability compare favorably to a first-year PhD student's contributions. Key findings: human intuition in prompts improves LLM 'taste', multi-turn workflows help when encouraging ambitious steps, and the LLM performs slightly below the first-year PhD student on the same problem. The work contributes empirical evidence on the practical utility and limits of LLMs as research collaborators in formal theory domains.

Evaluation and Benchmarking Stable Menus of Public Goods EC 2025

6arXiv · cs.AI·21d ago·source ↗

LLMs automate reproducibility assessments in social and behavioral sciences, outperforming human reanalysts

A preprint from arXiv demonstrates that an LLM pipeline can automate reproducibility assessments of published social and behavioral science studies, recovering original effect sizes in 41% of cases (vs. 34% for human reanalysts) and reaching the same qualitative conclusion in 96% of cases (vs. 74% for humans). The study evaluated 76 published studies with predefined claims. The results suggest LLMs could serve as a scalable tool for systematic auditing of empirical research, addressing the resource-intensive nature of traditional reproducibility efforts.

Evaluation and Benchmarking Agent and Tool Ecosystem Automated reproducibility assessments in the social and behavioral sciences using large language models

5arXiv · cs.CL·1mo ago·source ↗

LLMs Show Inverted Compositional Strengths vs. Humans on Reference Resolution Task

This paper evaluates LLMs and humans on the Personal Relation Task (Paperno 2022), distinguishing between Extensional tasks (resolving what an expression refers to) and Intensional tasks (representing structured sense/formula). The study finds that humans outperform LLMs on Extensional tasks while LLMs outperform humans on Intensional tasks—an inverted pattern of strengths. The authors argue this asymmetry reflects the absence of referential grounding in LLM training as a key gap in human-like language understanding.

Evaluation and Benchmarking Alignment and RLHF large language models referential grounding compositional generalization +2 more

5arXiv · cs.AI·25d ago·source ↗

Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance

A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.

Evaluation and Benchmarking AI Safety Research How reliable are LLMs when it comes to playing dice?

4arXiv · cs.CL·9d ago·source ↗

Cross-lingual evaluation framework reveals LLMs redistribute cultural narrative structure while preserving semantic meaning

A new arXiv preprint introduces a multilingual evaluation framework using 414 proverbs across 15 languages to assess whether LLMs preserve culturally grounded meaning when generating narratives. Using four LLMs to produce 13k narratives, the study finds that cross-lingual prompting preserves proverb-level semantic meaning but systematically redistributes agency, social positioning, and narrative structure. Strong inter-model convergence across architectures suggests multilingual LLMs rely on shared semantic abstractions. The authors argue that semantic similarity metrics alone overestimate cultural preservation in multilingual evaluations.

Evaluation and Benchmarking Same Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models

6arXiv · cs.CL·4d ago·source ↗

LLMs judge worse than they generate: empirical challenge to self-evaluation pipeline assumptions

A new arXiv preprint tests the implicit assumption that LLM evaluation is easier than generation, using a controlled in-context QA setup across four benchmarks (SQuAD 2.0, DROP, HotpotQA, MuSiQue) and two models. Results show generation accuracy exceeds self-evaluation accuracy on three of four benchmarks, with attention analysis revealing that evaluation attends to context 3–5x less than generation does. LoRA fine-tuning experiments confirm the asymmetry is not a training artifact, with cross-task interference observed in both directions. The findings directly challenge assumptions underlying LLM-as-a-Judge and self-evaluation pipelines widely used in RLHF and agentic systems.

Evaluation and Benchmarking Alignment and RLHF MuSiQue Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA LoRA +3 more

4arXiv · cs.CL·34h ago·source ↗

Survey chapter on LLM mechanisms, emergent capabilities, and cognition debates

A new arXiv preprint surveys current understanding of large language models, covering the Transformer architecture, emergent capabilities resembling human cognition (symbolic reasoning, theory of mind, deception), and explainability approaches from neuron activation analysis to circuit tracing. The chapter also engages the debate over whether LLMs genuinely understand or merely pattern-match, arguing against reductive anti-anthropomorphism while acknowledging human-LLM differences. It is framed as a book chapter synthesizing recent empirical findings and theoretical positions.

Evaluation and Benchmarking AI Safety Research Understanding Large Language Models