
GPQA Diamond
gpqa-diamond-aa4edc85·9 events·first seen 1mo agoAliases: GPQA Diamond, GPQA-Diamond
Co-occurring entities
More like this (12)
Recent events (9)
Advancing science and math with GPT-5.2
OpenAI has released GPT-5.2, described as its strongest model for mathematics and science, achieving state-of-the-art results on GPQA Diamond and FrontierMath benchmarks. The announcement highlights practical research applications including solving an open theoretical problem and generating verified mathematical proofs. The post positions GPT-5.2 as a meaningful step toward AI-assisted scientific discovery.
Entropy-Cut Metropolis-Hastings: Sampling-Based Reasoning Without RL Training
This paper introduces Entropy-Cut Metropolis-Hastings (ECMH), an algorithm that samples from a 'power distribution' over base language model outputs to elicit strong reasoning without reinforcement learning posttraining. Rather than cutting reasoning traces at uniformly random positions, ECMH uses next-token entropy as a proxy to identify consequential decision points (e.g., choice of proof strategy), then resamples from those positions. The authors prove that mixing time scales with the number of decisions rather than tokens, and demonstrate consistent improvements over RL-trained models on MATH500, HumanEval, GPQA Diamond, and AIME26.
LamPO: Lambda-Style Policy Optimization with Pairwise Decomposed Advantage for Reasoning LMs
LamPO proposes a new RLVR training objective that replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage, aggregating pairwise reward gaps within response groups and weighting comparisons by confidence-aware log-probability differences. The method retains a critic-free, clipped-update PPO-style structure and optionally adds a ROUGE-L-based dense auxiliary reward to reduce sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond using Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show consistent improvements over GRPO and other RLVR variants with more stable training dynamics.
Pair-In, Pair-Out (PIPO): Unified Latent Compression and Multi-Token Prediction for Efficient LLM Inference
PIPO is a new inference efficiency framework that unifies input-side latent compression with output-side multi-token prediction (MTP) by treating them as mirror operations: a compressor folds two input tokens into one latent, while an MTP head unfolds one hidden state into an additional output token. To avoid the expensive verifier pass typically required by speculative decoding, PIPO trains a lightweight confidence head using On-Policy Distillation (OPD), which naturally aligns with rejection-sampling criteria. Experiments on Qwen3.5-4B and 9B backbones across AIME 2025, GPQA-Diamond, LiveCodeBench v6, and LongBench v2 show up to 2.64× first-token-latency speedup and +7.15 pass@4 improvement over regular decoding.
Claude Mythos Preview: Limited-Release Frontier Model with Exceptional Cybersecurity Capabilities
Anthropic has published a 244-page model card for Claude Mythos Preview, a frontier model not yet commercially available, which autonomously discovered thousands of high-severity vulnerabilities in popular operating systems and browsers during testing. To mitigate risks before potential deployment, Anthropic assembled Project Glasswing, a consortium of over 40 organizations including AWS, Apple, Google, Microsoft, and CrowdStrike, funded with $100M in model credits to patch vulnerabilities proactively. The model substantially outperforms Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro across multiple benchmarks including CyberGym (83.1%), Terminal-Bench 2.0 (82%), GPQA Diamond (94.5%), HLE (64.7%), and GraphWalks long-context (80%). The Batch notes parallels to OpenAI's GPT-2 limited-release strategy and characterizes the announcement as having elements of a publicity stunt alongside genuine safety concerns.
Qwen3.5 Small tops mobile-sized open models; GPT-5.3 Instant, Gemini 3.1 Flash-Lite, Claude memory import, and LLM deanonymization research
Alibaba released the Qwen3.5 Small model series (0.8B–9B parameters) with a hybrid Gated Delta Networks + sparse MoE architecture, with the 9B model outperforming OpenAI's gpt-oss-120B on GPQA Diamond despite being 13.5x smaller; all weights are Apache 2.0 licensed. Google introduced Gemini 3.1 Flash-Lite, a cost-optimized model at $0.25/M input tokens with 2.5x faster TTFT than Gemini 2.5 Flash. OpenAI released GPT-5.3 Instant targeting conversational quality improvements and hallucination reduction, while Anthropic added memory import/export functionality across all Claude tiers. Separately, researchers from MATS, Anthropic, and ETH Zurich demonstrated that LLM-based pipelines can deanonymize pseudonymous online users at 68% recall/90% precision for $1–4 per profile.
Google's Aletheia agent uses Gemini 3 Deep Think to generate novel solutions to unsolved Erdős problems
Google researchers introduced Aletheia, an agentic workflow using Gemini 3 Deep Think that generates, verifies, and revises solutions to previously unsolved mathematical problems. Applied to Erdős problems, Aletheia produced 13 correct solutions out of 200 evaluated, with 4 being genuinely novel contributions not found in existing literature. The announcement also reveals Gemini 3 Deep Think's benchmark performance: 48.4% on HLE, 84.6% on ARC-AGI-2, and 93.8% on GPQA Diamond. The system demonstrates both the promise and current limitations of AI-assisted mathematical research, with a 6.5% correct-under-intended-interpretation rate on a hard problem set.
Data Points: Thinking Machines Interaction Model, ERNIE 5.1, Co-Mathematician, RL Conductor, and More
This edition of The Batch covers five notable AI developments: Thinking Machines' research preview of an 'interaction model' with a 200ms micro-turn multimodal architecture; Baidu's ERNIE 5.1, a compressed derivative of ERNIE 5.0 using only 6% of typical pre-training compute; Google DeepMind's Co-Mathematician collaborative workbench reaching 48% on FrontierMath Tier 4; a 7B RL Conductor model that orchestrates multi-agent workflows via reinforcement learning; and Google's Magic Pointer cursor system powered by Gemini. Secondary items include GitHub Copilot pricing restructuring ahead of usage-based billing.
Z.ai's GLM-5.1 Open-Weights Model Targets Multi-Hour Agentic Coding Tasks with Iterative Self-Evaluation
Z.ai released GLM-5.1, a 754B parameter mixture-of-experts open-weights model optimized for long-running agentic coding tasks, capable of cycling through planning, execution, and strategy revision hundreds of times over sessions lasting up to eight hours. The model achieves top open-weights scores on the Artificial Analysis Intelligence Index and third place on Arena's Code leaderboard, while leading SWE-Bench Pro in Z.ai's own evaluations at 58.4 percent. Weights are available on HuggingFace under MIT license, with API pricing roughly 40 percent higher than its predecessor but still below comparable proprietary models. No technical report has been published, leaving architecture and training details undisclosed.