6OpenAI Blog·1mo ago

AI-Written Critiques Help Humans Notice Flaws in Summaries

OpenAI trained critique-writing models to identify flaws in AI-generated summaries, finding that human evaluators catch significantly more errors when assisted by model-generated critiques. A key finding is that scale improves critique-writing ability more than summary-writing ability. The work is framed as a step toward using AI to assist human oversight of AI systems on difficult tasks, relevant to scalable oversight research.

Evaluation and Benchmarking AI Safety Research Alignment and RLHF AI-assisted human evaluation critique-writing model OpenAI scalable oversight

Related guides (3)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

5Openai Blog·1mo ago·source ↗

Summarizing Books with Human Feedback

OpenAI published research on using human feedback to train models to summarize entire books, addressing the challenge of scaling human oversight to tasks that are difficult for humans to evaluate directly. The work explores recursive task decomposition, where models summarize smaller chunks and then summarize those summaries, with humans providing feedback at each level. This represents an early concrete application of scalable oversight techniques to long-document understanding.

Long Context Evolution AI Safety Research Recursive Summarization Reinforcement Learning from Human Feedback OpenAI +2 more

6Openai Blog·1mo ago·source ↗

Learning to Summarize with Human Feedback

OpenAI published research applying reinforcement learning from human feedback (RLHF) to train language models for improved summarization quality. The work demonstrated that models trained with human preference signals outperform those trained purely on supervised objectives for summarization tasks. This paper is an early foundational contribution to the RLHF methodology that later became central to aligning large language models.

Evaluation and Benchmarking Alignment and RLHF Reinforcement Learning from Human Feedback OpenAI Learning to Summarize with Human Feedback

7Openai Blog·1mo ago·source ↗

Finding GPT-4's Mistakes with GPT-4: CriticGPT

OpenAI has developed CriticGPT, a GPT-4-based model trained to write critiques of ChatGPT outputs, helping human trainers identify errors during RLHF. The system is designed to address a core scalable oversight challenge: human raters often miss subtle mistakes in long or complex model outputs. CriticGPT-assisted trainers outperformed unassisted trainers in catching model errors, suggesting a path toward more reliable RLHF pipelines.

Evaluation and Benchmarking AI Safety Research ChatGPT CriticGPT Reinforcement Learning from Human Feedback +4 more

6Openai Blog·1mo ago·source ↗

AI Safety via Debate

OpenAI proposes a safety technique in which two AI agents debate a topic and a human judge determines the winner, with the goal of making it easier for humans to supervise AI systems that may be more capable than themselves. The core intuition is that it is easier to verify a correct argument than to generate one, so a dishonest agent can be caught by an honest opponent. The paper introduces debate as a scalable oversight mechanism applicable to complex tasks where direct human evaluation is infeasible.

Evaluation and Benchmarking AI Safety Research AI Safety via Debate Debate (AI safety technique)OpenAI +2 more

4One Useful Thing·1mo ago·source ↗

Giving your AI a Job Interview

This commentary piece argues that as AI-generated advice becomes more consequential, users need systematic methods to evaluate AI reliability and quality—analogous to a job interview process. The author proposes frameworks for assessing AI outputs before trusting them for important decisions. The piece addresses the practical challenge of calibrating trust in AI systems across different use cases.

Evaluation and Benchmarking Enterprise Deployment Patterns Ethan Mollick One Useful Thing

6arXiv · cs.CL·25d ago·source ↗

AI-Assisted Systematization for Evaluating GenAI Systems

This paper addresses a foundational gap in GenAI evaluation: the underspecification of broad, contested concepts like 'reasoning,' 'fairness,' or 'creativity.' The authors introduce a structured artifact called a 'concept spec' and a validation worksheet, then build two AI-assisted systematizers—a zero-shot approach and a multi-agent approach—to convert vague evaluation targets into measurable, structured accounts. They apply these tools to hate-based rhetoric and digital empathy, assessing the resulting specs on content validity and information recoverability. The work positions AI assistance as a scalable aid for the cognitively demanding process of evaluation design.

Evaluation and Benchmarking AI Safety Research hate-based rhetoric concept spec digital empathy +4 more

5arXiv · cs.LG·15d ago·source ↗

OpAI-Bench: Benchmark for detecting AI text across progressive human-AI co-editing workflows

Researchers introduce OpAI-Bench, a benchmark for studying AI-text detection across progressive human-to-AI document revision workflows, covering document, sentence, token, and span granularities. Starting from human-written documents, the benchmark constructs nine sequentially revised versions per sample under five AI edit operations and varying AI coverage levels across four domains. Key findings include that mixed-authorship intermediate versions are often harder to detect than fully human or heavily AI-edited endpoints, revealing non-monotonic detection patterns absent from existing benchmarks. The work addresses a gap in AI-text detection research as real-world documents increasingly result from iterative human-AI co-editing rather than pure generation.

Evaluation and Benchmarking AI Safety Research VILA-Lab OpAI-Bench

5Openai Blog·1mo ago·source ↗

New AI classifier for indicating AI-written text

OpenAI launched a classifier designed to distinguish between AI-generated and human-written text. The tool was positioned as an aid for detecting content produced by large language models. OpenAI acknowledged limitations including unreliability on short texts and non-English content, and noted the classifier should not be used as a sole decision-making tool.

Evaluation and Benchmarking AI Safety Research OpenAI AI Text Classifier OpenAI