5arXiv cs.AI (Artificial Intelligence)·16d ago

GASING pedagogy-guided CoT training enables strong arithmetic reasoning in 86M-parameter GPT-2 model

Researchers train a small 86M-parameter GPT-2 decoder from scratch using Chain-of-Thought supervision derived from GASING, an Indonesian left-to-right arithmetic pedagogy, without any reinforcement learning. The model achieves over 80% accuracy on held-out arithmetic problems and competes with substantially larger models. Mechanistic analyses reveal two emergent capabilities: an explicit procedural pathway and a subsequent associative 'mental arithmetic' capacity that bypasses step-by-step computation. The work suggests that pedagogically structured training data can yield efficient arithmetic capability at small scale.

Evaluation and Benchmarking Alignment and RLHF GASING TOBA tokenizer GPT-2 Arithmetic Pedagogy for Language Models

Related guides (2)

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Openai Blog·1mo ago·source ↗

OpenAI Trains System Solving Grade School Math Problems at ~55% Accuracy

OpenAI released a system for solving grade school math word problems that achieves roughly twice the accuracy of a fine-tuned GPT-3 model. The system scored 55% on a sample test where 9-12 year olds scored 60%, suggesting near-human performance on elementary math. This work represents an early milestone in neural network mathematical reasoning capabilities.

Frontier Model Releases Evaluation and Benchmarking GPT-3 OpenAI GSM8K

8Openai Blog·1mo ago·source ↗

Advancing science and math with GPT-5.2

OpenAI has released GPT-5.2, described as its strongest model for mathematics and science, achieving state-of-the-art results on GPQA Diamond and FrontierMath benchmarks. The announcement highlights practical research applications including solving an open theoretical problem and generating verified mathematical proofs. The post positions GPT-5.2 as a meaningful step toward AI-assisted scientific discovery.

Frontier Model Releases Evaluation and Benchmarking GPT-5.2 FrontierMath GPQA Diamond +2 more

7Openai Blog·1mo ago·source ↗

Improving Mathematical Reasoning with Process Supervision

OpenAI trained a model achieving state-of-the-art mathematical problem solving by rewarding each correct reasoning step (process supervision) rather than only the final answer (outcome supervision). This approach improves performance on math benchmarks and carries an alignment benefit by training models to produce human-endorsed chain-of-thought reasoning. The work highlights a potential synergy between capability improvements and alignment techniques.

Frontier Model Releases Evaluation and Benchmarking process supervision outcome supervision Chain-of-Thought Reasoning +3 more

7Openai Blog·1mo ago·source ↗

Reasoning models struggle to control their chains of thought, and that's good

OpenAI introduces CoT-Control, a framework for evaluating how well reasoning models can deliberately manipulate or suppress their chain-of-thought outputs. The finding that models struggle to control their CoT is framed as a positive safety property, reinforcing the argument that visible reasoning traces serve as a meaningful monitorability safeguard. This contributes to ongoing research on whether chain-of-thought transparency is a reliable alignment and oversight tool.

Frontier Model Releases Evaluation and Benchmarking CoT-Control monitorability Chain-of-Thought Reasoning +3 more

6arXiv · cs.CL·2d ago·source ↗

DreamReasoner-8B: Block-size curriculum learning enables long-CoT reasoning in diffusion language models

Researchers introduce DreamReasoner-8B, an open-source block diffusion language model trained with a block-size curriculum learning strategy that gradually transitions from fine-grained to coarse-grained block sizes during training. The work identifies a critical failure mode: training with large block sizes severely degrades reasoning, while small block sizes preserve it. The proposed curriculum bridges this gap, achieving math and code reasoning performance competitive with Qwen3-8B while retaining the parallel decoding efficiency of block diffusion models. The model and code are publicly released.

Frontier Model Releases Open Weights Progress Qwen3-4B Block-Size Curriculum Learning for Diffusion Reasoning Models Block-Size Curriculum Learning +3 more

8Latent Space·1mo ago·source ↗

OpenAI GPT-next Solves 80-Year-Old Erdős Planar Unit Distance Problem for Under $1000

A Latent Space AINews digest reports that OpenAI's GPT-next model disproved the Erdős planar unit distance conjecture, an 80-year-old open problem in combinatorial geometry, at a compute cost under $1000. The item is framed as a notable AI-assisted mathematics result. The brief characterizes it as a quiet day overall but highlights this as a meaningful capability demonstration at the intersection of AI and formal mathematics.

Frontier Model Releases Evaluation and Benchmarking GPT-next Erdős planar unit distance problem OpenAI +1 more

4Openai Blog·1mo ago·source ↗

New ways to learn math and science in ChatGPT

OpenAI is adding interactive visual explanations for math and science topics to ChatGPT, allowing users to explore formulas and variables in real time. The feature targets students and learners, representing an expansion of ChatGPT's educational capabilities. This is a product-layer enhancement rather than a new model or core capability release.

Enterprise Deployment Patterns ChatGPT OpenAI

7Openai Blog·1mo ago·source ↗

GPT-5 and the future of mathematical discovery

UCLA Professor Ernest Ryu collaborated with GPT-5 to solve an open problem in optimization theory, representing a concrete example of AI-assisted mathematical research. The announcement highlights GPT-5's capability in formal reasoning and scientific discovery beyond standard benchmarks. This is an OpenAI blog post showcasing a real-world research outcome involving a frontier model.

Frontier Model Releases Evaluation and Benchmarking UCLA optimization theory OpenAI +2 more