6OpenAI Blog·1mo ago

Prover-Verifier Games improve legibility of language model outputs

OpenAI presents research on prover-verifier games as a mechanism to improve the legibility and verifiability of language model outputs. The approach frames output generation as a game between a prover (the model producing solutions) and a verifier (checking correctness), incentivizing clearer, more human-auditable reasoning. The work targets a core alignment challenge: ensuring AI-generated solutions are interpretable and trustworthy to both humans and automated systems.

Evaluation and Benchmarking AI Safety Research Alignment and RLHF Prover-Verifier Games OpenAI scalable oversight

Related guides (3)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

5Openai Blog·1mo ago·source ↗

Generative Language Modeling for Automated Theorem Proving

OpenAI published research on applying generative language models to automated theorem proving, an early exploration of using neural language models to assist formal mathematical reasoning. The work investigates how language models can generate proof steps or complete proofs in formal systems. This represents an early milestone in AI-assisted mathematical reasoning, predating later work like GPT-f and subsequent theorem-proving systems.

Frontier Model Releases Evaluation and Benchmarking automated theorem proving generative language modeling GPT-f +1 more

6arXiv · cs.AI·3d ago·source ↗

VERITAS: Visual verification enables inference-time steering and autonomous improvement for robot policies

Researchers introduce VERITAS, a generator-verifier framework pairing a pre-trained generalist robot policy with a gradient-free visual verifier to steer actions at inference time without additional training. Verified rollouts are also used for offline self-improvement via fine-tuning, achieving performance gains comparable to expert demonstrations but without human intervention. The work demonstrates that inference-time verification is a scalable mechanism for autonomous policy improvement during deployment.

Inference Economics Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement VERITAS

6arXiv · cs.AI·5d ago·source ↗

Self-improving VLMs can silently regress when verifier quality is task-mismatched

A new arXiv paper demonstrates that verifier-driven self-DPO, a common recipe for self-improving visual-language models, can silently degrade student model performance when the verifier's task-rubric accuracy is insufficient for the target task. Experiments on Qwen-3-VL-2B and Qwen-2.5-VL-3B across MathVista, MMMU, and BLINK show regressions of 3.4–10.9 percentage points below frozen baselines, with the counterintuitive finding that more accurate-but-still-wrong verifiers cause larger regressions than near-random ones. The authors provide a mechanistic explanation via a variance theorem for progress-gated replay and offer operational guidance: measure target-task rubric accuracy before running any verifier-driven loop and rank verifiers by task-specific quality rather than parameter count.

Evaluation and Benchmarking Alignment and RLHF MathVista When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks BLINK +5 more

6Openai Blog·1mo ago·source ↗

How Confessions Can Keep Language Models Honest

OpenAI researchers are developing a training method called 'confessions' that teaches language models to explicitly admit when they have made mistakes or behaved undesirably. The approach aims to improve honesty, transparency, and user trust in model outputs. This represents an alignment-oriented intervention targeting self-reporting of model failures.

AI Safety Research Alignment and RLHF Confessions (training method)OpenAI

6arXiv · cs.AI·23d ago·source ↗

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

OmniVerifier-M1 is a generalist visual verifier trained using symbolic meta-verification rationales (e.g., bounding boxes) and decoupled reinforcement learning objectives for binary judgment versus meta-verification. The paper finds that symbolic verifier outputs outperform textual explanations as rationales, enabling rule-based RL rewards without auxiliary judge models, and that decoupling RL objectives substantially improves performance over joint optimization. The system further enables M1-TTS, a verifier-driven agentic generation pipeline supporting dynamic region-level self-correction in multimodal outputs.

Evaluation and Benchmarking Agent and Tool Ecosystem Multimodal Large Language Models multimodal meta-verification decoupled reinforcement learning +8 more

5Openai Blog·1mo ago·source ↗

Improving Verifiability in AI Development: Multi-Stakeholder Report

OpenAI contributed to a multi-stakeholder report co-authored by 58 researchers across 30 organizations, including Mila, CSET, and the Schwartz Reisman Institute. The report identifies 10 mechanisms for improving the verifiability of claims about AI systems. These tools are intended to help developers demonstrate safety, security, fairness, and privacy properties, while enabling policymakers and civil society to evaluate AI development processes.

Evaluation and Benchmarking AI Safety Research Centre for the Future of Intelligence Center for Security and Emerging Technology Mila +4 more

7arXiv · cs.AI·26d ago·source ↗

Agentic Proving for Program Verification: Claude Code Achieves 98.1% on CLEVER Benchmark

Researchers evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation, achieving 98.1% end-to-end success on program generation and verification over self-consistent entries. The system generates valid specifications for 98.8% of problems and certifies implementations against ground-truth specifications for 87.5% of problems. The results reveal a growing mismatch between existing program verification benchmark difficulty and modern agentic prover capabilities, motivating calls for more rigorous evaluation methodologies. The findings support compiler-in-the-loop agentic paradigms as the current state-of-the-art for foundational program verification.

Evaluation and Benchmarking AI Safety Research CLEVER isomorphism-based scoring agentic proving +4 more

7arXiv · cs.CL·22d ago·source ↗

Self-Trained Verification (STV) Unlocks Training- and Test-Time Self-Improvement for Reasoning Models

This paper introduces Self-Trained Verification (STV), a method that trains a verifier to imitate a more informed version of itself by leveraging reference solutions as supervision signal, addressing the core bottleneck in both test-time verification-refinement loops and self-training pipelines. At test time, STV roughly doubles accuracy on hard math and achieves a 14x lift on scientific reasoning tasks. At training time, the authors combine STV with RL in a procedure called Verifier-in-the-Loop (ViL) training, yielding a 33% further gain in pass@1 over an already RL-converged generator, with standalone pass@1 climbing 30% relative past standard RL convergence. The work argues that verification quality, not generation, is the primary bottleneck for scaling reasoning on hard problems.

Frontier Model Releases Evaluation and Benchmarking self-training Verifier-in-the-Loop Training (ViL)Self-Trained Verification (STV)+4 more