Entity · benchmark

GPQA

benchmarkactivegpqa-c570c1f1·8 events·first seen May 18, 2026

Aliases: GPQA

Co-occurring entities

More like this (12)

GPQA Diamond GQA AutoGPTQ Global-PIQA GPTQ tcGP PubMedQA Protocol QA MedQA GRPO IndQA SimpleQA

Recent events (8)

7The Batch·Jul 17, 2026·source ↗

OpenAI releases GPT-Live-1 voice models with full-duplex audio and background reasoning via GPT-5.5

OpenAI released GPT-Live-1 and GPT-Live-1 mini on July 8, 2026, replacing Advanced Voice Mode with a full-duplex voice system that processes audio input and output simultaneously. When deeper reasoning is needed, the voice model delegates to GPT-5.5 or GPT-5.5 Thinking in the background while continuing to speak. GPT-Live-1 at high reasoning scored 84.2% on GPQA versus 45.3% for its predecessor AVM, and human raters preferred it 75.7% of the time. The release also covers Andrew Ng's editorial on AI's labor market effects and a segment on detecting manipulative model behavior.

Frontier Model Releases Inference Economics DeepLearning.AI GPT-Live ChatGPT +7 more

7The Batch·Jul 17, 2026·source ↗

OpenAI GPT-Live Pairs Full-Duplex Voice Models with GPT-5.5 Reasoning Backend

OpenAI released GPT-Live-1 and GPT-Live-1 mini on July 8, 2026, replacing Advanced Voice Mode with a full-duplex voice system that processes audio continuously and delegates harder queries to GPT-5.5 in the background. The architecture separates a real-time conversational voice model from a reasoning model, with user-selectable reasoning effort levels (Instant, Medium, High) routing to GPT-5.5 Instant or GPT-5.5 Thinking accordingly. Performance gains are substantial: GPQA scores jumped from 45.3% (AVM) to 84.2% (GPT-Live-1 at high reasoning), and BrowseComp improved from 0.7% to 75.2%. The system is live globally on iOS, Android, and ChatGPT.com for paid plans, though no developer API has shipped yet.

Frontier Model Releases Agent and Tool Ecosystem Thinking Machines GPT-Live ChatGPT +18 more

5arXiv · cs.LG·Jul 3, 2026·source ↗

DemoPSD: Disagreement-modulated policy self-distillation to fix privileged information leakage in LLM reasoning training

DemoPSD is a new training framework for LLMs that addresses two failure modes in on-policy self-distillation (OPSD): overfitting to in-domain patterns and privileged information leakage, where the student model learns answer-dependent shortcuts unavailable at test time. The method steers the student toward a reverse-KL barycenter target — a weighted geometric blend of teacher and student distributions — with token-level blending weights derived from the disagreement between the two distributions. Experiments on SciKnowEval across four scientific domains show DemoPSD outperforms GRPO and SDPO while maintaining higher training entropy and generalizing to out-of-distribution GPQA benchmarks.

Evaluation and Benchmarking Alignment and RLHF SciKnowEval GRPO SDPO +2 more

6Anthropic News·Jun 4, 2026·source ↗

Anthropic publishes policy brief calling for targeted AI regulation within 18 months

Anthropic published a policy position paper arguing that governments have an 18-month window to enact narrowly-targeted AI regulation before risks in cyber and CBRN domains become acute. The post cites rapid capability gains—SWE-bench scores rising from 1.96% to 49% in a year, GPQA scores approaching human expert level—as evidence that frontier models are approaching meaningful misuse thresholds. Anthropic also reviews its Responsible Scaling Policy as a model for adaptive, proportionate risk governance and calls for similar frameworks to be adopted industry-wide and codified in law.

AI Safety Research Regulatory Developments Anthropic Policy Frontier Red Team Claude 3.5 Sonnet UK AI Security Institute +5 more

9Anthropic News·Jun 3, 2026·source ↗

Anthropic launches Claude 3 model family: Haiku, Sonnet, and Opus

Anthropic announced the Claude 3 model family on March 4, 2024, comprising three models — Haiku, Sonnet, and Opus — in ascending capability order. Claude 3 Opus claims top performance on major benchmarks including MMLU, GPQA, and GSM8K, with near-perfect recall on long-context evaluations (200K context window, 99%+ NIAH accuracy) and new multimodal vision capabilities. The release also highlights reduced unnecessary refusals, a twofold accuracy improvement over Claude 2.1, and Constitutional AI-based safety tuning. Opus and Sonnet launched immediately via claude.ai and the Claude API across 159 countries, with Haiku to follow.

Long Context Evolution Frontier Model Releases Claude Opus 4.6 Constitutional AI Claude Haiku 4.5 +8 more

8Anthropic News·Jun 2, 2026·source ↗

Introducing Claude 3.5 Sonnet

Anthropic launches Claude 3.5 Sonnet, the first model in its Claude 3.5 family, claiming it outperforms Claude 3 Opus and competitor models on GPQA, MMLU, and HumanEval benchmarks while operating at twice the speed and mid-tier pricing ($3/$15 per million tokens). The model features a 200K context window, improved vision capabilities, and an internal agentic coding evaluation score of 64% versus 38% for Opus. Alongside the model, Anthropic introduces Artifacts on Claude.ai, a dedicated workspace for real-time editing of AI-generated content. The model was pre-deployment evaluated by the UK AI Safety Institute and assessed at ASL-2.

Long Context Evolution Frontier Model Releases claude.ai Thorn Amazon Bedrock +16 more

6arXiv · cs.CL·May 19, 2026·source ↗

GIM: A Grounded Integration Measure Benchmark for Evaluating Multi-Domain Cognitive Coordination in LLMs

The Grounded Integration Measure (GIM) is a new LLM benchmark of 820 original problems designed to resist benchmark saturation by requiring integration of multiple cognitive operations—constraint satisfaction, state tracking, epistemic vigilance, audience calibration—over broadly accessible knowledge. Unlike knowledge-escalation benchmarks (GPQA, HLE) or pure abstraction benchmarks (ARC-AGI), GIM grounds reasoning in realistic tasks without gating on specialized expertise. The authors calibrate a 2-parameter logistic IRT model over 200k+ prompt-response pairs across 28 models and 47 test configurations, producing the most extensive published study of test-time compute vs. model capability tradeoffs on a fixed benchmark. A key finding is that within-family configuration choices (thinking budget, quantization) matter as much as model selection.

Frontier Model Releases Evaluation and Benchmarking 2-Parameter Logistic IRT Model GIM (Grounded Integration Measure)test-time compute +4 more

6The Batch·May 18, 2026·source ↗

Data Points: Thinking Machines Interaction Model, ERNIE 5.1, Co-Mathematician, RL Conductor, and More

This edition of The Batch covers five notable AI developments: Thinking Machines' research preview of an 'interaction model' with a 200ms micro-turn multimodal architecture; Baidu's ERNIE 5.1, a compressed derivative of ERNIE 5.0 using only 6% of typical pre-training compute; Google DeepMind's Co-Mathematician collaborative workbench reaching 48% on FrontierMath Tier 4; a 7B RL Conductor model that orchestrates multi-agent workflows via reinforcement learning; and Google's Magic Pointer cursor system powered by Gemini. Secondary items include GitHub Copilot pricing restructuring ahead of usage-based billing.

Training Infrastructure Frontier Model Releases Thinking Machines SGLang GitHub +21 more