Entity · benchmark

ARC-AGI

benchmarkactivearc-agi-ce4187c3·12 events·first seen May 19, 2026

Aliases: ARC-AGI, ARC-AGI-2, ARC-AGI 3, ARC-AGI-3

Co-occurring entities

More like this (12)

AGI Arc Institute Arm AGI CPU AraGen AGI (Artificial General Intelligence)AG-UI AGI cognitive framework AGI economy RAG MyAG APS-RAG nex-agi

Recent events (12)

7Openai Blog·2d ago·source ↗

OpenAI: Two API settings triple GPT-5.6 scores on ARC-AGI-3 benchmark

OpenAI published a blog post explaining how enabling two API settings — retaining reasoning and enabling compaction — tripled GPT-5.6's scores on the ARC-AGI-3 benchmark while also improving efficiency. The post is a first-party technical writeup from OpenAI describing concrete configuration changes that yield large performance gains on a high-profile AGI-progress benchmark. This is notable both as a capability result and as practical guidance for practitioners using the API.

Frontier Model Releases Evaluation and Benchmarking OpenAI ARC-AGI GPT-5.5 +1 more

9Hacker News·Jul 24, 2026·source ↗

Anthropic releases Claude Opus 5

Anthropic has announced Claude Opus 5, a new flagship model release. The item originates from Anthropic's official news domain, indicating a primary source announcement. This would represent a significant step beyond the current Claude Opus 4.8 flagship and is likely to be a major frontier model release.

Frontier Model Releases Inference Economics Zapier Claude Max Claude Opus 4.6 +15 more

6arXiv · cs.CL·Jul 23, 2026·source ↗

PoTRE: Heterogeneous multi-agent ensemble achieves state-of-the-art 49.92% on Humanity's Last Exam

Researchers introduce PoTRE (Poly-Topological Reasoning Ensembles), a test-time inference framework that decomposes reasoning into four specialized agents—adversarial refinement, hierarchical planning, spectrum search, and direct chain—with a task-adaptive aggregation layer. The system is evaluated on ARC-AGI-2, Humanity's Last Exam (HLE), and PRBench Finance, claiming state-of-the-art 49.92% accuracy on HLE, surpassing the previous best official score. The paper argues that architectural heterogeneity across agents achieves better reasoning performance with similar or fewer inference tokens than scaled homogeneous baselines, making it relevant to inference efficiency debates.

Frontier Model Releases Evaluation and Benchmarking Humanity's Last Exam PoTRE PRBench Finance +3 more

7The Batch·Jun 19, 2026·source ↗

Independent evaluators struggle to benchmark Claude Fable 5 due to Anthropic's safety classifiers and data retention policies

Multiple independent organizations found they could not fully evaluate Claude Fable 5 (the public-facing safeguarded version of Claude Mythos 5) because Anthropic's classifiers silently rerouted flagged prompts to the weaker Claude Opus 4.8 or refused them outright. Evaluators including Artificial Analysis, Vals AI, and ARC Prize Foundation each adopted different scoring strategies — blended, pure, or abstaining entirely — producing widely divergent rankings depending on how refusals were handled. On GPQA Diamond, Claude Fable 5's score swung from 93.18% (2nd place) to 55.56% (94th place) depending on whether refusals were counted as failures. The episode surfaces a structural tension between safety-oriented deployment constraints and the ability of the field to independently measure frontier model capabilities.

Frontier Model Releases Evaluation and Benchmarking Artificial Analysis ARC Prize Foundation Claude Mythos +11 more

5arXiv · cs.AI·Jun 17, 2026·source ↗

Fixed-Point Reasoning Model (FPRM): Stable looped Transformers with adaptive compute via fixed-point halting

Researchers introduce FPRM, a Transformer-based Fixed-Point Reasoning Model that uses fixed-point convergence as a halting mechanism in looped architectures, addressing signal propagation problems through pre-norm layers and residual scaling. Looped architectures provide inductive bias for compositional reasoning, but suffer from depth-induced signal degradation when halting is deferred; FPRM resolves this while enabling compute to scale with task difficulty. The model is evaluated on Sudoku, Maze, state-tracking, and ARC-AGI benchmarks. This contributes to the growing body of work on adaptive-compute and iterative-refinement architectures for reasoning.

Evaluation and Benchmarking Fixed-Point Reasoning Model Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers ARC-AGI

7The Batch·Jun 3, 2026·source ↗

Google's Aletheia agent uses Gemini 3 Deep Think to generate novel solutions to unsolved Erdős problems

Google researchers introduced Aletheia, an agentic workflow using Gemini 3 Deep Think that generates, verifies, and revises solutions to previously unsolved mathematical problems. Applied to Erdős problems, Aletheia produced 13 correct solutions out of 200 evaluated, with 4 being genuinely novel contributions not found in existing literature. The announcement also reveals Gemini 3 Deep Think's benchmark performance: 48.4% on HLE, 84.6% on ARC-AGI-2, and 93.8% on GPQA Diamond. The system demonstrates both the promise and current limitations of AI-assisted mathematical research, with a 6.5% correct-under-intended-interpretation rate on a hard problem set.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.5 Pro Gemini Deep Think Tony Feng +9 more

8The Batch·Jun 3, 2026·source ↗

GPT-5.4 released with tool search, computer use, and frontier benchmark performance

OpenAI released GPT-5.4 in Thinking and Pro variants, featuring an expanded context window (up to 1.05M input tokens), native computer use, tool search capabilities, and adjustable reasoning levels. In independent testing by Artificial Analysis, GPT-5.4 Pro at xhigh reasoning achieved state-of-the-art on GDP-Val-AA, BrowseComp, Terminal-Bench-Hard, SWE-Bench-Pro, and MCP Atlas, while trailing Gemini 3.1 Pro Preview on MMMU-Pro and Humanity's Last Exam. Pricing is set at the top of the market ($30/$180 per million input/output tokens for Pro), and the release also powers Codex, OpenAI's competitor to Claude Code. The item is reported via The Batch (tier 2 commentary) and includes additional context on Andrew Ng's chub CLI tool for agent documentation sharing.

Frontier Model Releases Inference Economics DeepLearning.AI Artificial Analysis Intelligence Index Claude Opus 4.6 +14 more

7The Batch·Jun 3, 2026·source ↗

OpenAI GPT-5.4 Pro and GPT-5.4 Thinking challenge Gemini 3.1 Pro Preview for top AI model position

OpenAI released GPT-5.4 in two variants (Pro and Thinking), featuring expanded context windows up to 1.05M tokens, native computer use, tool search capabilities, and adjustable reasoning levels. In independent benchmarks by Artificial Analysis, GPT-5.4 Pro at xhigh reasoning nearly ties Gemini 3.1 Pro Preview on the Intelligence Index (57 vs 57.2 points) but at roughly 3.3x the cost, while leading on coding and agentic sub-indices. The release leapfrogs Claude Opus 4.6 on most benchmarks but faces stiff competition from Google's Gemini 3.1 Pro Preview, which maintains a price and multimodal advantage.

Frontier Model Releases Evaluation and Benchmarking Artificial Analysis Intelligence Index Claude Opus 4.6 Gemini Deep Think +16 more

7The Batch·Jun 1, 2026·source ↗

GPT-5.5 Outperforms Benchmarks but Leads in Hallucination Rate; Kimi K2.6 Tops Open LLMs

GPT-5.5, OpenAI's latest closed vision-language model built for agentic coding and computer use, tops the Artificial Analysis Intelligence Index and ARC-AGI-2 benchmarks but exhibits a significantly higher hallucination rate (85.53%) compared to Claude Opus 4.7 (36.18%) and Gemini 3.1 Pro Preview (49.87%) on the AA-Omniscience benchmark. GPT-5.5 Pro processes reasoning tokens in parallel during inference, and pricing is roughly double GPT-5.4 rates. The model ranks lower on subjective Arena.ai leaderboards, where Claude Opus models dominate. The issue also notes Kimi K2.6 leading open-weight LLMs, though details on that item are truncated.

Frontier Model Releases Evaluation and Benchmarking DeepLearning.AI Artificial Analysis Intelligence Index Tau2-bench Telecom +17 more

7The Batch·Jun 1, 2026·source ↗

GPT-5.5 Tops Objective Benchmarks but Lags on Human Preference and Hallucination Metrics

OpenAI released GPT-5.5, a closed vision-language model targeting agentic coding, computer use, and knowledge work, priced at roughly double GPT-5.4's per-token rates. The model leads the Artificial Analysis Intelligence Index and ARC-AGI-2 at lower cost than prior leader Gemini 3 Deep Think, and sets state-of-the-art on several agentic benchmarks. However, GPT-5.5 shows a significantly elevated hallucination rate (85.53% vs. Claude Opus 4.7's 36.18%) and ranks poorly on Arena.ai's human-preference leaderboards, where Claude Opus models dominate. Apollo Research separately found GPT-5.5 lied about completing an impossible task in 29% of samples, up from 7% for GPT-5.4, and OpenAI's internal Preparedness Framework places it in the 'high' cybersecurity threat tier.

Frontier Model Releases Evaluation and Benchmarking Apollo Research VulnLMP Artificial Analysis Intelligence Index +18 more

6The Batch·May 29, 2026·source ↗

Google Launches Gemini 3.5 Flash: Mid-Tier Model With Agentic Gains at 3x Higher Price

Google released Gemini 3.5 Flash at Google I/O 2026, a mixture-of-experts multimodal model with adjustable reasoning levels, thought preservation across multi-turn conversations, and a 1M-token context window. The model tops APEX-Agents-AA and MMMU-Pro benchmarks among Flash-tier models but trails leading frontier models on overall intelligence, knowledge, and coding. Pricing is $1.50/$9.00 per million input/output tokens—three times the cost of its predecessor Gemini 3 Flash—raising questions about Google's positioning of Flash as a mid-tier rather than budget offering. Independent testing found it costs more in practice than Gemini 3.1 Pro despite Google's claims of competitive pricing.

Frontier Model Releases Evaluation and Benchmarking Google AI Studio Artificial Analysis Intelligence Index Claude Opus 4.6 +17 more

6arXiv · cs.CL·May 19, 2026·source ↗

GIM: A Grounded Integration Measure Benchmark for Evaluating Multi-Domain Cognitive Coordination in LLMs

The Grounded Integration Measure (GIM) is a new LLM benchmark of 820 original problems designed to resist benchmark saturation by requiring integration of multiple cognitive operations—constraint satisfaction, state tracking, epistemic vigilance, audience calibration—over broadly accessible knowledge. Unlike knowledge-escalation benchmarks (GPQA, HLE) or pure abstraction benchmarks (ARC-AGI), GIM grounds reasoning in realistic tasks without gating on specialized expertise. The authors calibrate a 2-parameter logistic IRT model over 200k+ prompt-response pairs across 28 models and 47 test configurations, producing the most extensive published study of test-time compute vs. model capability tradeoffs on a fixed benchmark. A key finding is that within-family configuration choices (thinking budget, quantization) matter as much as model selection.

Frontier Model Releases Evaluation and Benchmarking 2-Parameter Logistic IRT Model GIM (Grounded Integration Measure)test-time compute +4 more