8arXiv cs.AI (Artificial Intelligence)·29d ago

Large-Scale Evaluation of LLM-Driven Formal Proof Search on Open Mathematical Problems

Researchers present the first large-scale evaluation of LLM-based formal proof search on genuinely open mathematical problems, using Lean as a verification backend. Their most capable agent autonomously resolved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures, at a cost of a few hundred dollars per problem. The system is already being deployed in active research across combinatorics, optimization, graph theory, algebraic geometry, and quantum optics. The study also compares agent architectures, finding that more sophisticated designs outperform simple generate-and-verify loops on the hardest problems.

Frontier Model Releases Evaluation and Benchmarking Agent and Tool Ecosystem large language models Erdős Problems OEIS Conjectures Lean Formal Proof Search

Related guides (3)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

7Openai Blog·1mo ago·source ↗

OpenAI Neural Theorem Prover Solves Formal Math Olympiad Problems in Lean

OpenAI developed a neural theorem prover integrated with the Lean proof assistant that can solve challenging high-school olympiad problems, including problems from AMC12, AIME, and two IMO-adapted problems. The system demonstrates automated formal mathematical reasoning at a level previously requiring human expertise. This represents a significant capability milestone in AI-assisted formal verification and mathematical problem-solving.

Frontier Model Releases Evaluation and Benchmarking AIME Neural Theorem Prover OpenAI +3 more

7Mistral Ai News·1mo ago·source ↗

Mistral Releases Leanstral: First Open-Source Code Agent for Lean 4 Formal Verification

Mistral AI has released Leanstral, an open-source code agent built on a sparse 120B/6B-active-parameter architecture, designed specifically for formal proof engineering in Lean 4. The model targets realistic proof engineering workflows rather than isolated math competition problems, and is benchmarked on FLTEval, a new evaluation suite tied to the Fermat's Last Theorem formalization project. Leanstral is released under Apache 2.0 with a free API endpoint and MCP support, and demonstrates competitive performance against Claude Sonnet 4.6 at roughly 1/15th the cost. The release positions formal verification as a scalable alternative to human code review for high-stakes software and mathematics.

Evaluation and Benchmarking Open Weights Progress Mistral AI Claude Sonnet 4 Claude Opus 4.6 +11 more

5Hugging Face Blog·1mo ago·source ↗

Fixing Open LLM Leaderboard with Math-Verify

Hugging Face introduces Math-Verify, a tool designed to address evaluation reliability issues in the Open LLM Leaderboard by improving mathematical answer verification. The post describes problems with existing string-matching approaches that lead to inaccurate benchmark scores for math tasks. Math-Verify aims to provide more robust symbolic and numerical answer checking to reduce false positives and negatives in leaderboard evaluations.

Evaluation and Benchmarking Open LLM Leaderboard Hugging Face Math-Verify

8arXiv · cs.AI·15d ago·source ↗

Goedel-Architect achieves state-of-the-art formal theorem proving with blueprint-based agentic framework

Goedel-Architect is an agentic framework for formal theorem proving in Lean 4 that uses blueprint generation — a dependency graph of definitions and lemmas — rather than recursive decomposition, enabling parallel lemma closure and global refinement. Built on DeepSeek-V4-Flash (284B-A13B), it achieves 99.2% pass@1 on MiniF2F-test and 75.6% on PutnamBench, scaling to 100% on MiniF2F, 88.8% on PutnamBench, and 4/6 on IMO 2025 when seeded with natural-language proofs. The authors claim state-of-the-art performance for an open-source pipeline at up to 500x lower cost than comparable systems.

Frontier Model Releases Evaluation and Benchmarking MiniF2F DeepSeek-V4-Flash Goedel-Architect +3 more

5arXiv · cs.AI·12d ago·source ↗

Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance

A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.

Evaluation and Benchmarking AI Safety Research How reliable are LLMs when it comes to playing dice?How reliable are LLMs when it comes to playing dice?

6Hugging Face Blog·1mo ago·source ↗

Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models

Kimina-Prover is a new large formal reasoning model that combines reinforcement learning with test-time search to improve mathematical theorem proving. The approach applies RL-trained search strategies at inference time, targeting formal proof generation in systems like Lean. The work is published via the AI-MO (AI for Math Olympiad) team on Hugging Face, continuing the trend of applying RL and extended compute at test time to hard reasoning tasks.

Frontier Model Releases Evaluation and Benchmarking Kimina-Prover-RL Hugging Face AI-MO +4 more

7arXiv · cs.CL·17d ago·source ↗

PROVE framework trains LLMs for multi-step tool use via stateful MCP environments and programmatic rewards

Researchers introduce PROVE (Programmatic Rewards On Verified Environments), a framework for training LLMs to orchestrate multi-step tool calls using reinforcement learning. The system includes a library of 20 stateful MCP servers with 343 tools, an automated data synthesis pipeline that grounds training queries in live server state, and a multi-component programmatic reward function requiring no judge model. Training four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with ~13K examples yields gains of up to +10.2 on BFCL Multi-Turn, +6.8 on tau2-bench, and +6.5 on T-Eval, demonstrating consistent improvements in multi-step tool orchestration.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen2.5-7B GRPO Qwen3-4B +7 more

6arXiv · cs.LG·2d ago·source ↗

Diffusion-Proof: First framework applying diffusion LLMs to formal theorem proving

Researchers introduce Diffusion-Proof, the first framework to train and apply diffusion language models (dLLMs) for formal theorem proving, addressing limitations of autoregressive models in long-range coherence. The framework includes dLLM-Prover-7B for whole-proof generation and dLLM-Corrector-7B for local proof correction via bidirectional infilling. Diffusion-Proof achieves absolute improvements of 1.61% on ProofNet-Test and 6.14% on MiniF2F-Test over an AR baseline, and solves one IMO problem that DeepSeek-Prover-V2-7B could not. The result suggests dLLMs may have structural advantages over AR models for tasks requiring long-range logical coherence.

Frontier Model Releases Evaluation and Benchmarking dLLM-Prover-7B Diffusion-Proof DeepSeek-Prover-V2-7B +3 more