7OpenAI Blog·1mo ago

OpenAI Neural Theorem Prover Solves Formal Math Olympiad Problems in Lean

OpenAI developed a neural theorem prover integrated with the Lean proof assistant that can solve challenging high-school olympiad problems, including problems from AMC12, AIME, and two IMO-adapted problems. The system demonstrates automated formal mathematical reasoning at a level previously requiring human expertise. This represents a significant capability milestone in AI-assisted formal verification and mathematical problem-solving.

Frontier Model Releases Evaluation and Benchmarking AIME Neural Theorem Prover OpenAI IMO Lean AMC12

Related guides (3)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6Openai Blog·1mo ago·source ↗

OpenAI Shares First Proof Math Challenge Submissions

OpenAI has published its AI model's proof attempts for the First Proof math challenge, a competition designed to test research-grade mathematical reasoning on expert-level problems. This represents a capability demonstration of OpenAI's models on formal mathematical proof generation. The submission signals continued progress in AI mathematical reasoning at a level approaching or engaging with professional research mathematics.

Frontier Model Releases Evaluation and Benchmarking First Proof OpenAI

5arXiv · cs.AI·1mo ago·source ↗

AI-Assisted Theorem Proving in Lean 4: Aristotle API Case Study on IMO 2009 Problem 6

This paper presents a case study of using the Aristotle API for AI-assisted formal theorem proving in Lean 4, targeting the Grasshopper problem (IMO 2009 Problem 6). The generated artifact verifies four helper lemmas but leaves the main theorem unresolved via a 'sorry' placeholder, exposing a key limitation: local proof search can succeed while global combinatorial bookkeeping remains unsolved. The study provides a reproducible Lean artifact and precise analysis distinguishing verified from unverified proof content, offering a concrete benchmark for evaluating AI formalization capabilities.

Evaluation and Benchmarking Agent and Tool Ecosystem AI-assisted theorem proving Grasshopper Problem (IMO 2009 P6)Aristotle API +1 more

8arXiv · cs.AI·29d ago·source ↗

Large-Scale Evaluation of LLM-Driven Formal Proof Search on Open Mathematical Problems

Researchers present the first large-scale evaluation of LLM-based formal proof search on genuinely open mathematical problems, using Lean as a verification backend. Their most capable agent autonomously resolved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures, at a cost of a few hundred dollars per problem. The system is already being deployed in active research across combinatorics, optimization, graph theory, algebraic geometry, and quantum optics. The study also compares agent architectures, finding that more sophisticated designs outperform simple generate-and-verify loops on the hardest problems.

Frontier Model Releases Evaluation and Benchmarking large language models Erdős Problems OEIS Conjectures +3 more

5Openai Blog·1mo ago·source ↗

OpenAI Trains System Solving Grade School Math Problems at ~55% Accuracy

OpenAI released a system for solving grade school math word problems that achieves roughly twice the accuracy of a fine-tuned GPT-3 model. The system scored 55% on a sample test where 9-12 year olds scored 60%, suggesting near-human performance on elementary math. This work represents an early milestone in neural network mathematical reasoning capabilities.

Frontier Model Releases Evaluation and Benchmarking GPT-3 OpenAI GSM8K

5Openai Blog·1mo ago·source ↗

Generative Language Modeling for Automated Theorem Proving

OpenAI published research on applying generative language models to automated theorem proving, an early exploration of using neural language models to assist formal mathematical reasoning. The work investigates how language models can generate proof steps or complete proofs in formal systems. This represents an early milestone in AI-assisted mathematical reasoning, predating later work like GPT-f and subsequent theorem-proving systems.

Frontier Model Releases Evaluation and Benchmarking automated theorem proving generative language modeling GPT-f +1 more

5Hugging Face Blog·1mo ago·source ↗

Kimina-Prover-RL: Reinforcement Learning for Formal Mathematical Proving

Hugging Face blog post introduces Kimina-Prover-RL, a model trained with reinforcement learning targeting formal mathematical theorem proving. The post appears to describe a system from the AI-MO (AI for Math Olympiad) initiative. This represents a development in applying RL to formal proof generation, a competitive area involving Lean/Mathlib-style verification environments.

Evaluation and Benchmarking AI Safety Research Kimina-Prover-RL Hugging Face AI-MO +1 more

6Hugging Face Blog·1mo ago·source ↗

How NuminaMath Won the 1st AIMO Progress Prize

NuminaMath won the first AI Mathematical Olympiad (AIMO) Progress Prize, a competition focused on advancing AI capabilities in mathematical reasoning. The blog post details the technical approach and methodology used by the winning team. This represents a notable milestone in AI mathematical problem-solving, a domain considered a key frontier for reasoning capabilities.

Frontier Model Releases Evaluation and Benchmarking AI Mathematical Olympiad NuminaMath Hugging Face +1 more

8arXiv · cs.AI·15d ago·source ↗

Goedel-Architect achieves state-of-the-art formal theorem proving with blueprint-based agentic framework

Goedel-Architect is an agentic framework for formal theorem proving in Lean 4 that uses blueprint generation — a dependency graph of definitions and lemmas — rather than recursive decomposition, enabling parallel lemma closure and global refinement. Built on DeepSeek-V4-Flash (284B-A13B), it achieves 99.2% pass@1 on MiniF2F-test and 75.6% on PutnamBench, scaling to 100% on MiniF2F, 88.8% on PutnamBench, and 4/6 on IMO 2025 when seeded with natural-language proofs. The authors claim state-of-the-art performance for an open-source pipeline at up to 500x lower cost than comparable systems.

Frontier Model Releases Evaluation and Benchmarking MiniF2F DeepSeek-V4-Flash Goedel-Architect +3 more