OpAI-Bench: Benchmark for detecting AI text across progressive human-AI co-editing workflows
Researchers introduce OpAI-Bench, a benchmark for studying AI-text detection across progressive human-to-AI document revision workflows, covering document, sentence, token, and span granularities. Starting from human-written documents, the benchmark constructs nine sequentially revised versions per sample under five AI edit operations and varying AI coverage levels across four domains. Key findings include that mixed-authorship intermediate versions are often harder to detect than fully human or heavily AI-edited endpoints, revealing non-monotonic detection patterns absent from existing benchmarks. The work addresses a gap in AI-text detection research as real-world documents increasingly result from iterative human-AI co-editing rather than pure generation.
Related guides (2)
Related events (8)
PaperBench: OpenAI Benchmark for Evaluating AI Agents on Research Replication
OpenAI introduces PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art AI research papers end-to-end. The benchmark targets a high-complexity capability: reproducing experimental results from frontier AI research, which requires code generation, experimental design, and scientific reasoning. This positions PaperBench as a tool for tracking progress toward autonomous AI research agents.
Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)
The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.
New AI classifier for indicating AI-written text
OpenAI launched a classifier designed to distinguish between AI-generated and human-written text. The tool was positioned as an aid for detecting content produced by large language models. OpenAI acknowledged limitations including unreliability on short texts and non-English content, and noted the classifier should not be used as a sole decision-making tool.
Multi-domain benchmark for detecting AI-generated text-rich images from GPT-Image-2
Researchers introduce a new benchmark of 8,602 images across six categories (commercial posters, infographics, academic posters, receipts, tables, UI screenshots) specifically for detecting AI-generated text-rich images produced by OpenAI's GPT-Image-2. Five zero-shot detectors are evaluated, revealing highly domain-dependent performance and severe sensitivity to JPEG compression even in the strongest conventional detector. A multimodal VLM is also explored as a detector, showing promise but limitations on structured formats. The work highlights a gap in existing benchmarks that focus on object-centric rather than text-layout-centric images.
OpenAI introduces LifeSciBench, a life sciences AI evaluation benchmark
OpenAI has released LifeSciBench, a benchmark designed to evaluate AI systems on real-world life science research tasks and decisions. The benchmark is described as expert-authored and expert-reviewed, targeting domain-specific evaluation in biology and related fields. This addresses a gap in specialized scientific benchmarking for AI systems.
Opus 4.6, Codex 5.3, and the post-benchmark era
A Interconnects commentary piece examining how to compare frontier AI models in 2026, using Anthropic's Opus 4.6 and OpenAI's Codex 5.3 as case studies. The piece appears to argue that traditional benchmarks are no longer sufficient for distinguishing model capabilities at the frontier. This reflects a broader industry shift toward more nuanced, task-specific evaluation methods.
Benchmark Agent: Autonomous system for end-to-end benchmark construction
Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.
AI-Written Critiques Help Humans Notice Flaws in Summaries
OpenAI trained critique-writing models to identify flaws in AI-generated summaries, finding that human evaluators catch significantly more errors when assisted by model-generated critiques. A key finding is that scale improves critique-writing ability more than summary-writing ability. The work is framed as a step toward using AI to assist human oversight of AI systems on difficult tasks, relevant to scalable oversight research.

