Entity · benchmark

GPQA Diamond

benchmarkactivegpqa-diamond-aa4edc85·13 events·first seen May 18, 2026

Aliases: GPQA Diamond, GPQA-Diamond

Co-occurring entities

Google GPT-5.5 Anthropic Gemini 3.1 Pro LiveCodeBench OpenAI Claude Opus 4.6 Qwen3-4B GRPO Humanity's Last Exam SWE-bench RL Conductor Terminal-Bench ARC-AGI HLE CyberGym AIME26 FrontierMath Fugu Sakana AI

More like this (12)

GPQA GQA AutoGPTQ GPTQ Global-PIQA CXR-VQA tcGP MedMCQA DPG Benchmark IndQA Protocol QA Gemini Advanced

Recent events (13)

7The Batch·Jul 3, 2026·source ↗

Sakana AI releases Fugu and Fugu-Ultra orchestrator models that spawn Claude, Gemini, and GPT agents

Sakana AI, a Tokyo-based research lab, released two dedicated orchestrator models—Fugu and Fugu-Ultra—that dynamically delegate tasks to a pool of underlying LLMs including Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5 under a single API. Fugu-Ultra achieves state-of-the-art results on SWE-Bench Pro, Humanity's Last Exam, LiveCodeBench Pro, and GPQA-Diamond, outperforming individual frontier models on several benchmarks. The models are trained via supervised fine-tuning plus sep-CMA-ES evolutionary optimization and GRPO reinforcement learning to select the best worker model per subtask, with Fugu-Ultra using a sub-component called Conductor to coordinate parallel agentic workflows. The approach represents a commercially available alternative to dependence on any single frontier model, with pricing available via Sakana API, OpenRouter, and Vercel.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.1 Pro Fugu GRPO +17 more

6arXiv · cs.LG·Jul 2, 2026·source ↗

Theoria: Structured verification architecture for auditable AI reasoning via typed state transitions

Theoria is a verification architecture that rewrites candidate AI solutions into sequences of typed state transitions, each requiring an explicit justification (citation, computation, or given fact), making every reasoning step independently auditable. On HLE-Verified Gold (185 expert problems), Theoria certifies 105 at 91.4% strict precision, and on adversarial poisoned proofs catches 94.7% of errors versus 83.2% for holistic LLM judges — a gap concentrated in hidden premises and fabricated citations. The approach is complementary to scalar LLM judges (Jaccard overlap 0.14–0.36), suggesting ensemble use. On GPQA Diamond, certified precision reaches 97.1%.

Evaluation and Benchmarking AI Safety Research HLE-Verified Gold GPQA Diamond Theoria

7arXiv · cs.AI·Jun 26, 2026·source ↗

Co-failure ceiling theorem bounds maximum gains from LLM routing, voting, and mixture-of-agents across 67 frontier models

A new arXiv paper introduces the concept of a 'co-failure ceiling' — the rate at which all models in an ensemble fail on the same query — and proves that no routing, voting, or cascade policy can exceed accuracy of (1 - beta) where beta is this all-wrong rate. Empirically evaluated across 67 models from 21 providers, the paper finds that standard pairwise error correlation metrics systematically underprice the co-failure tail by ~2.5x on open-ended mathematics, and that combining models rarely beats the single best model without strong query-level routing signals. The work provides a finite-sample certificate (via Clopper-Pearson bounds) for the maximum achievable gain from multi-model systems before training a router, and identifies answer format rather than subject matter as a key driver of co-failure on GPQA-Diamond.

Evaluation and Benchmarking Inference Economics Mixture-of-Agents When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models Clopper-Pearson +2 more

7The Batch·Jun 19, 2026·source ↗

Independent evaluators struggle to benchmark Claude Fable 5 due to Anthropic's safety classifiers and data retention policies

Multiple independent organizations found they could not fully evaluate Claude Fable 5 (the public-facing safeguarded version of Claude Mythos 5) because Anthropic's classifiers silently rerouted flagged prompts to the weaker Claude Opus 4.8 or refused them outright. Evaluators including Artificial Analysis, Vals AI, and ARC Prize Foundation each adopted different scoring strategies — blended, pure, or abstaining entirely — producing widely divergent rankings depending on how refusals were handled. On GPQA Diamond, Claude Fable 5's score swung from 93.18% (2nd place) to 55.56% (94th place) depending on whether refusals were counted as failures. The episode surfaces a structural tension between safety-oriented deployment constraints and the ability of the field to independently measure frontier model capabilities.

Frontier Model Releases Evaluation and Benchmarking Artificial Analysis ARC Prize Foundation Claude Mythos +11 more

6The Batch·Jun 3, 2026·source ↗

Qwen3.5 Small tops mobile-sized open models; GPT-5.3 Instant, Gemini 3.1 Flash-Lite, Claude memory import, and LLM deanonymization research

Alibaba released the Qwen3.5 Small model series (0.8B–9B parameters) with a hybrid Gated Delta Networks + sparse MoE architecture, with the 9B model outperforming OpenAI's gpt-oss-120B on GPQA Diamond despite being 13.5x smaller; all weights are Apache 2.0 licensed. Google introduced Gemini 3.1 Flash-Lite, a cost-optimized model at $0.25/M input tokens with 2.5x faster TTFT than Gemini 2.5 Flash. OpenAI released GPT-5.3 Instant targeting conversational quality improvements and hallucination reduction, while Anthropic added memory import/export functionality across all Claude tiers. Separately, researchers from MATS, Anthropic, and ETH Zurich demonstrated that LLM-based pipelines can deanonymize pseudonymous online users at 68% recall/90% precision for $1–4 per profile.

Frontier Model Releases Open Weights Progress Claude Google Alibaba +14 more

7The Batch·Jun 3, 2026·source ↗

Google's Aletheia agent uses Gemini 3 Deep Think to generate novel solutions to unsolved Erdős problems

Google researchers introduced Aletheia, an agentic workflow using Gemini 3 Deep Think that generates, verifies, and revises solutions to previously unsolved mathematical problems. Applied to Erdős problems, Aletheia produced 13 correct solutions out of 200 evaluated, with 4 being genuinely novel contributions not found in existing literature. The announcement also reveals Gemini 3 Deep Think's benchmark performance: 48.4% on HLE, 84.6% on ARC-AGI-2, and 93.8% on GPQA Diamond. The system demonstrates both the promise and current limitations of AI-assisted mathematical research, with a 6.5% correct-under-intended-interpretation rate on a hard problem set.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.5 Pro Gemini Deep Think Tony Feng +9 more

8The Batch·Jun 2, 2026·source ↗

Claude Mythos Preview: Limited-Release Frontier Model with Exceptional Cybersecurity Capabilities

Anthropic has published a 244-page model card for Claude Mythos Preview, a frontier model not yet commercially available, which autonomously discovered thousands of high-severity vulnerabilities in popular operating systems and browsers during testing. To mitigate risks before potential deployment, Anthropic assembled Project Glasswing, a consortium of over 40 organizations including AWS, Apple, Google, Microsoft, and CrowdStrike, funded with $100M in model credits to patch vulnerabilities proactively. The model substantially outperforms Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro across multiple benchmarks including CyberGym (83.1%), Terminal-Bench 2.0 (82%), GPQA Diamond (94.5%), HLE (64.7%), and GraphWalks long-context (80%). The Batch notes parallels to OpenAI's GPT-2 limited-release strategy and characterizes the announcement as having elements of a publicity stunt alongside genuine safety concerns.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.1 Pro GraphWalks Linux Foundation +18 more

7The Batch·Jun 1, 2026·source ↗

Z.ai's GLM-5.1 Open-Weights Model Targets Multi-Hour Agentic Coding Tasks with Iterative Self-Evaluation

Z.ai released GLM-5.1, a 754B parameter mixture-of-experts open-weights model optimized for long-running agentic coding tasks, capable of cycling through planning, execution, and strategy revision hundreds of times over sessions lasting up to eight hours. The model achieves top open-weights scores on the Artificial Analysis Intelligence Index and third place on Arena's Code leaderboard, while leading SWE-Bench Pro in Z.ai's own evaluations at 58.4 percent. Weights are available on HuggingFace under MIT license, with API pricing roughly 40 percent higher than its predecessor but still below comparable proprietary models. No technical report has been published, leaving architecture and training details undisclosed.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.1 Pro Artificial Analysis Intelligence Index Claude Opus 4.6 +14 more

7arXiv · cs.LG·May 29, 2026·source ↗

Entropy-Cut Metropolis-Hastings: Sampling-Based Reasoning Without RL Training

This paper introduces Entropy-Cut Metropolis-Hastings (ECMH), an algorithm that samples from a 'power distribution' over base language model outputs to elicit strong reasoning without reinforcement learning posttraining. Rather than cutting reasoning traces at uniformly random positions, ECMH uses next-token entropy as a proxy to identify consequential decision points (e.g., choice of proof strategy), then resamples from those positions. The authors prove that mixing time scales with the number of decisions rather than tokens, and demonstrate consistent improvements over RL-trained models on MATH500, HumanEval, GPQA Diamond, and AIME26.

Frontier Model Releases Evaluation and Benchmarking power distribution MATH500 Entropy-Cut Metropolis-Hastings +6 more

6arXiv · cs.CL·May 27, 2026·source ↗

Pair-In, Pair-Out (PIPO): Unified Latent Compression and Multi-Token Prediction for Efficient LLM Inference

PIPO is a new inference efficiency framework that unifies input-side latent compression with output-side multi-token prediction (MTP) by treating them as mirror operations: a compressor folds two input tokens into one latent, while an MTP head unfolds one hidden state into an additional output token. To avoid the expensive verifier pass typically required by speculative decoding, PIPO trains a lightweight confidence head using On-Policy Distillation (OPD), which naturally aligns with rejection-sampling criteria. Experiments on Qwen3.5-4B and 9B backbones across AIME 2025, GPQA-Diamond, LiveCodeBench v6, and LongBench v2 show up to 2.64× first-token-latency speedup and +7.15 pass@4 improvement over regular decoding.

Long Context Evolution Inference Economics On-Policy Distillation (OPD)Multi-Token Prediction (MTP)speculative decoding +7 more

5arXiv · cs.CL·May 21, 2026·source ↗

LamPO: Lambda-Style Policy Optimization with Pairwise Decomposed Advantage for Reasoning LMs

LamPO proposes a new RLVR training objective that replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage, aggregating pairwise reward gaps within response groups and weighting comparisons by confidence-aware log-probability differences. The method retains a critic-free, clipped-update PPO-style structure and optionally adds a ROUGE-L-based dense auxiliary reward to reduce sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond using Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show consistent improvements over GRPO and other RLVR variants with more stable training dynamics.

Frontier Model Releases Evaluation and Benchmarking RLVR ROUGE-L AIME24 +10 more

8Openai Blog·May 20, 2026·source ↗

Advancing science and math with GPT-5.2

OpenAI has released GPT-5.2, described as its strongest model for mathematics and science, achieving state-of-the-art results on GPQA Diamond and FrontierMath benchmarks. The announcement highlights practical research applications including solving an open theoretical problem and generating verified mathematical proofs. The post positions GPT-5.2 as a meaningful step toward AI-assisted scientific discovery.

Frontier Model Releases Evaluation and Benchmarking GPT-5.2 FrontierMath GPQA Diamond +2 more

6The Batch·May 18, 2026·source ↗

Data Points: Thinking Machines Interaction Model, ERNIE 5.1, Co-Mathematician, RL Conductor, and More

This edition of The Batch covers five notable AI developments: Thinking Machines' research preview of an 'interaction model' with a 200ms micro-turn multimodal architecture; Baidu's ERNIE 5.1, a compressed derivative of ERNIE 5.0 using only 6% of typical pre-training compute; Google DeepMind's Co-Mathematician collaborative workbench reaching 48% on FrontierMath Tier 4; a 7B RL Conductor model that orchestrates multi-agent workflows via reinforcement learning; and Google's Magic Pointer cursor system powered by Gemini. Secondary items include GitHub Copilot pricing restructuring ahead of usage-based billing.

Training Infrastructure Frontier Model Releases Thinking Machines SGLang GitHub +21 more