5OpenAI Blog·1mo ago

OpenAI Trains System Solving Grade School Math Problems at ~55% Accuracy

OpenAI released a system for solving grade school math word problems that achieves roughly twice the accuracy of a fine-tuned GPT-3 model. The system scored 55% on a sample test where 9-12 year olds scored 60%, suggesting near-human performance on elementary math. This work represents an early milestone in neural network mathematical reasoning capabilities.

Frontier Model Releases Evaluation and Benchmarking GPT-3 OpenAI GSM8K

Related guides (3)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

7Openai Blog·1mo ago·source ↗

OpenAI Neural Theorem Prover Solves Formal Math Olympiad Problems in Lean

OpenAI developed a neural theorem prover integrated with the Lean proof assistant that can solve challenging high-school olympiad problems, including problems from AMC12, AIME, and two IMO-adapted problems. The system demonstrates automated formal mathematical reasoning at a level previously requiring human expertise. This represents a significant capability milestone in AI-assisted formal verification and mathematical problem-solving.

Frontier Model Releases Evaluation and Benchmarking AIME Neural Theorem Prover OpenAI +3 more

8Openai Blog·1mo ago·source ↗

Advancing science and math with GPT-5.2

OpenAI has released GPT-5.2, described as its strongest model for mathematics and science, achieving state-of-the-art results on GPQA Diamond and FrontierMath benchmarks. The announcement highlights practical research applications including solving an open theoretical problem and generating verified mathematical proofs. The post positions GPT-5.2 as a meaningful step toward AI-assisted scientific discovery.

Frontier Model Releases Evaluation and Benchmarking GPT-5.2 FrontierMath GPQA Diamond +2 more

6Openai Blog·1mo ago·source ↗

OpenAI Shares First Proof Math Challenge Submissions

OpenAI has published its AI model's proof attempts for the First Proof math challenge, a competition designed to test research-grade mathematical reasoning on expert-level problems. This represents a capability demonstration of OpenAI's models on formal mathematical proof generation. The submission signals continued progress in AI mathematical reasoning at a level approaching or engaging with professional research mathematics.

Frontier Model Releases Evaluation and Benchmarking First Proof OpenAI

7Openai Blog·1mo ago·source ↗

Improving Mathematical Reasoning with Process Supervision

OpenAI trained a model achieving state-of-the-art mathematical problem solving by rewarding each correct reasoning step (process supervision) rather than only the final answer (outcome supervision). This approach improves performance on math benchmarks and carries an alignment benefit by training models to produce human-endorsed chain-of-thought reasoning. The work highlights a potential synergy between capability improvements and alignment techniques.

Frontier Model Releases Evaluation and Benchmarking process supervision outcome supervision Chain-of-Thought Reasoning +3 more

5arXiv · cs.AI·16d ago·source ↗

GASING pedagogy-guided CoT training enables strong arithmetic reasoning in 86M-parameter GPT-2 model

Researchers train a small 86M-parameter GPT-2 decoder from scratch using Chain-of-Thought supervision derived from GASING, an Indonesian left-to-right arithmetic pedagogy, without any reinforcement learning. The model achieves over 80% accuracy on held-out arithmetic problems and competes with substantially larger models. Mechanistic analyses reveal two emergent capabilities: an explicit procedural pathway and a subsequent associative 'mental arithmetic' capacity that bypasses step-by-step computation. The work suggests that pedagogically structured training data can yield efficient arithmetic capability at small scale.

Evaluation and Benchmarking Alignment and RLHF GASING TOBA tokenizer GPT-2 +1 more

8Latent Space·1mo ago·source ↗

OpenAI GPT-next Solves 80-Year-Old Erdős Planar Unit Distance Problem for Under $1000

A Latent Space AINews digest reports that OpenAI's GPT-next model disproved the Erdős planar unit distance conjecture, an 80-year-old open problem in combinatorial geometry, at a compute cost under $1000. The item is framed as a notable AI-assisted mathematics result. The brief characterizes it as a quiet day overall but highlights this as a meaningful capability demonstration at the intersection of AI and formal mathematics.

Frontier Model Releases Evaluation and Benchmarking GPT-next Erdős planar unit distance problem OpenAI +1 more

9Openai Blog·1mo ago·source ↗

An OpenAI model has disproved a central conjecture in discrete geometry

An OpenAI model has disproved a major conjecture in discrete geometry by solving the 80-year-old unit distance problem. This represents a milestone in AI-driven mathematical reasoning, demonstrating that frontier AI systems can produce novel, verifiable mathematical results rather than merely verifying or assisting with known proofs. The announcement comes from OpenAI's official blog, indicating a significant capability demonstration.

Frontier Model Releases Evaluation and Benchmarking discrete geometry OpenAI unit distance problem

9Openai Blog·1mo ago·source ↗

Introducing GPT-5

OpenAI has released GPT-5, described as its most capable AI system to date. The model claims state-of-the-art performance across a broad range of domains including coding, mathematics, writing, health, and visual perception. The announcement positions GPT-5 as a significant intelligence leap over all prior OpenAI models.

Frontier Model Releases Evaluation and Benchmarking OpenAI GPT-5.5 +2 more