7arXiv cs.LG (Machine Learning)·25d ago

DiscoverPhysics: Interactive Benchmark for LLM Scientific Discovery in Novel Physics Worlds

DiscoverPhysics is a new interactive benchmark that tests LLM agents on their ability to discover laws of motion in 22 simulated worlds with deliberately non-standard physics, including screened gravity, fractional-power interactions, and hidden dark-matter-like particles. Agents must propose experiments, observe N-body trajectory data, and submit both natural-language explanations and Python implementations of inferred laws. Evaluation across eleven frontier models shows the best agents pass only half the worlds, with consistent failures on latent-structure problems and a substantial gap between open-source and commercial models. The benchmark reveals that predictive accuracy and conceptual understanding are dissociable, and that genuine hypothesis refinement through well-designed experiments is required for high explanation scores.

Frontier Model Releases Evaluation and Benchmarking Agent and Tool Ecosystem LLM-judged explanation score N-body simulator trajectory MSE DiscoverPhysics

Related guides (3)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·25d ago·source ↗

CausaLab: Scalable Benchmark for Interactive Causal Discovery by LLM Agents

CausaLab is a new evaluation environment that tests LLM agents on interactive causal discovery tasks, requiring them to recover both causal graphs and structural equations from synthetic laboratory episodes governed by randomly sampled structural causal models (SCMs). The benchmark separates predictive accuracy from genuine causal understanding, revealing a persistent gap: GPT-5.2-high achieves 92% task accuracy in a 6-node observational setting but only 0.471 all-edge F1 for mechanism recovery. Mixed observation-intervention strategies improve structural fidelity, while pure intervention strategies underperform on both metrics. Premature stopping is identified as a key agent weakness, partially mitigated by prompting models to verify hypothesis-data consistency.

Evaluation and Benchmarking AI Safety Research all-edge F1 GPT-5.2-high causal discovery +3 more

6arXiv · cs.CL·8d ago·source ↗

EurekAgent: Environment Engineering as the Key Bottleneck for Autonomous Scientific Discovery

EurekAgent is a new LLM-based agent system that reframes autonomous scientific discovery around 'environment engineering' — designing the resources, constraints, and interfaces that shape agent behavior — rather than prescribing agent workflows. The system engineers four dimensions: permissions, artifact management (filesystem/Git), budget awareness, and human-in-the-loop oversight. It achieves state-of-the-art results on mathematics, kernel engineering, and ML tasks, including new 26-circle packing results at under $11 in API cost, and is fully open-sourced.

Evaluation and Benchmarking Agent and Tool Ecosystem EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery EurekAgent

5arXiv · cs.AI·12d ago·source ↗

Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance

A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.

Evaluation and Benchmarking AI Safety Research How reliable are LLMs when it comes to playing dice?How reliable are LLMs when it comes to playing dice?

4arXiv · cs.CL·8d ago·source ↗

SupraBench: First benchmark for evaluating LLMs on supramolecular chemistry reasoning

Researchers introduce SupraBench, the first benchmark designed to systematically evaluate LLMs on supramolecular chemistry tasks including binding affinity prediction, top-binder selection, solvent identification, and host-guest description. The work also releases SupraPMC, a 16M-token corpus of supramolecular chemistry articles from Europe PMC to support domain adaptation. Evaluation of broad open and proprietary LLMs reveals substantial headroom across all tasks, with domain pretraining improving in-distribution regression but creating format compliance tradeoffs. The benchmark targets a narrow but practically important scientific domain where LLM acceleration could reduce days-long dry-lab verification cycles.

Evaluation and Benchmarking SupraPMC Europe PMC SupraBench

5arXiv · cs.CL·19d ago·source ↗

PowerCodeBench: Knowledge Boundary Probing and Intervention for LLM-Based Power System Code Generation

This paper introduces PowerCodeBench, an execution-validated benchmark for evaluating LLMs on power-system simulation code generation using the pandapower library. The authors identify that failures are dominated by API-knowledge boundary errors (hallucinated function names, misused parameters) rather than reasoning failures, and propose a boundary-aware intervention combining API demand estimation with targeted documentation injection. Evaluated across ten open-weight models (1.5B–480B) and four commercial APIs on 2,000 tasks, the intervention yields 32–56 accuracy point improvements while using only 41% of baseline prompt-token cost. Open-weight models in the 70B–120B range match commercial mid-tier accuracy, with Llama-3.1-405B and Qwen3-Coder-480B leading.

Evaluation and Benchmarking Open Weights Progress pandapower Meta Llama 3.1 405B Alibaba +7 more

6arXiv · cs.CL·11d ago·source ↗

SpatialWorld benchmark evaluates interactive spatial reasoning of multimodal agents in real-world tasks

Researchers introduce SpatialWorld, a benchmark for evaluating interactive spatial understanding of multimodal agents across 760 human-annotated tasks spanning household, travel, and social domains. The benchmark integrates eight simulation backends under a shared protocol, requiring agents to operate under vision-only partial observability with egocentric inputs. Evaluation of 15 agents reveals that even the strongest model, GPT-5, achieves only 17.4% task success rate, exposing significant gaps in active exploration and long-horizon planning. The work highlights a mismatch between task success and execution efficiency as a key bottleneck for spatial agents.

Evaluation and Benchmarking Agent and Tool Ecosystem SpatialWorld OpenAI Qwen 3.5 +2 more

7arXiv · cs.AI·16d ago·source ↗

AutoLab benchmark evaluates frontier models on ultra long-horizon iterative research and engineering tasks

AutoLab is a new benchmark of 36 expert-curated tasks across system optimization, puzzle-solving, model development, and CUDA kernel optimization, designed to test agents on sustained closed-loop improvement under wall-clock budgets rather than single-turn or short-horizon settings. Evaluation of 17 frontier models finds that persistence in iterative benchmarking and feedback incorporation — not initial attempt quality — is the dominant success predictor. Claude Opus 4.6 stands out as the strongest performer, while most models including proprietary ones either terminate early or exhaust budgets with minimal progress. The benchmark, harness, and task artifacts are open-sourced.

Frontier Model Releases Evaluation and Benchmarking Claude Opus 4.6 AutoLab Anthropic +1 more

8arXiv · cs.AI·29d ago·source ↗

Large-Scale Evaluation of LLM-Driven Formal Proof Search on Open Mathematical Problems

Researchers present the first large-scale evaluation of LLM-based formal proof search on genuinely open mathematical problems, using Lean as a verification backend. Their most capable agent autonomously resolved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures, at a cost of a few hundred dollars per problem. The system is already being deployed in active research across combinatorics, optimization, graph theory, algebraic geometry, and quantum optics. The study also compares agent architectures, finding that more sophisticated designs outperform simple generate-and-verify loops on the hardest problems.

Frontier Model Releases Evaluation and Benchmarking large language models Erdős Problems OEIS Conjectures +3 more