Entity · benchmark

BrowseComp

benchmarkactivebrowsecomp-e908fd4e·10 events·first seen May 20, 2026

Aliases: BrowseComp, BrowseComp+, K-BrowseComp, BrowseComp-zh

Co-occurring entities

More like this (12)

BrowseComp-Plus CompVis BigCode BigCodeBench Cursor Composer 2.5 Browser-Use CLIP context compaction ChipBench Common Crawl USearch Cursor

Recent events (10)

7The Batch·Jul 17, 2026·source ↗

OpenAI GPT-Live Pairs Full-Duplex Voice Models with GPT-5.5 Reasoning Backend

OpenAI released GPT-Live-1 and GPT-Live-1 mini on July 8, 2026, replacing Advanced Voice Mode with a full-duplex voice system that processes audio continuously and delegates harder queries to GPT-5.5 in the background. The architecture separates a real-time conversational voice model from a reasoning model, with user-selectable reasoning effort levels (Instant, Medium, High) routing to GPT-5.5 Instant or GPT-5.5 Thinking accordingly. Performance gains are substantial: GPQA scores jumped from 45.3% (AVM) to 84.2% (GPT-Live-1 at high reasoning), and BrowseComp improved from 0.7% to 75.2%. The system is live globally on iOS, Android, and ChatGPT.com for paid plans, though no developer API has shipped yet.

Frontier Model Releases Agent and Tool Ecosystem Thinking Machines GPT-Live ChatGPT +18 more

7arXiv · cs.CL·Jul 13, 2026·source ↗

Mach-Mind-4-Flash: 35B MoE agentic model matching 100B-class performance via post-training optimization

Mach-Mind-4-Flash is a 35B-parameter Mixture-of-Experts model with only 3B activated parameters that achieves performance comparable to 100B-class models through post-training techniques alone. The pipeline combines a unified RL/OPD training infrastructure with multi-teacher scheduling, parallel domain-specific RL experts fused via Multi-Teacher On-Policy Distillation (MOPD), and Hybrid Median-length Policy Optimization (HMPO) which compresses reasoning chains 19-46% with minimal accuracy loss. Benchmark results include 92.70 on AIME'26, 82.82 on IFBench, and 75.80 on BFCL-v4, claiming to lead or match models 10-30x its activated size at a fraction of inference cost. The work is notable for demonstrating that post-training optimization can close large gaps in activated parameter count for agentic tasks.

Inference Economics Agent and Tool Ecosystem IFBench Behavioral-SafetyBench AIME 2026 +7 more

7Hacker News·Jun 30, 2026·source ↗

Anthropic releases Claude Sonnet 5

Anthropic has released Claude Sonnet 5, a new mid-tier model in their Claude lineup. The announcement comes via the official Anthropic news page and generated significant community engagement on Hacker News with 714 points and 386 comments. As a new named model release from a frontier lab, this is a notable update to the Claude model family.

Frontier Model Releases Inference Economics Claude Sonnet 3.5 Claude Sonnet 4 BrowseComp +5 more

7arXiv · cs.CL·Jun 30, 2026·source ↗

Agents-A1: 35B MoE agent matches trillion-parameter models via horizon scaling

Researchers introduce Agents-A1, a 35B Mixture-of-Experts model that claims to match or exceed trillion-parameter models like Kimi-K2 and DeepSeek V4 on long-horizon agentic benchmarks. The approach scales agent trajectory length (averaging 45K tokens) and heterogeneous agent abilities rather than raw parameter count, using a three-stage training recipe including multi-teacher domain-routed distillation. On benchmarks such as SEAL-0, IFBench, HiPhO, and FrontierScience-Olympiad, Agents-A1 achieves leading or competitive results against models with roughly 30x more parameters. The work proposes a practical efficiency path for agentic capability scaling without proportional compute scaling.

Frontier Model Releases Inference Economics IFBench Kimi K2 DeepSeek V4 +8 more

8The Batch·Jun 3, 2026·source ↗

GPT-5.4 released with tool search, computer use, and frontier benchmark performance

OpenAI released GPT-5.4 in Thinking and Pro variants, featuring an expanded context window (up to 1.05M input tokens), native computer use, tool search capabilities, and adjustable reasoning levels. In independent testing by Artificial Analysis, GPT-5.4 Pro at xhigh reasoning achieved state-of-the-art on GDP-Val-AA, BrowseComp, Terminal-Bench-Hard, SWE-Bench-Pro, and MCP Atlas, while trailing Gemini 3.1 Pro Preview on MMMU-Pro and Humanity's Last Exam. Pricing is set at the top of the market ($30/$180 per million input/output tokens for Pro), and the release also powers Codex, OpenAI's competitor to Claude Code. The item is reported via The Batch (tier 2 commentary) and includes additional context on Andrew Ng's chub CLI tool for agent documentation sharing.

Frontier Model Releases Inference Economics DeepLearning.AI Artificial Analysis Intelligence Index Claude Opus 4.6 +14 more

7The Batch·Jun 3, 2026·source ↗

OpenAI GPT-5.4 Pro and GPT-5.4 Thinking challenge Gemini 3.1 Pro Preview for top AI model position

OpenAI released GPT-5.4 in two variants (Pro and Thinking), featuring expanded context windows up to 1.05M tokens, native computer use, tool search capabilities, and adjustable reasoning levels. In independent benchmarks by Artificial Analysis, GPT-5.4 Pro at xhigh reasoning nearly ties Gemini 3.1 Pro Preview on the Intelligence Index (57 vs 57.2 points) but at roughly 3.3x the cost, while leading on coding and agentic sub-indices. The release leapfrogs Claude Opus 4.6 on most benchmarks but faces stiff competition from Google's Gemini 3.1 Pro Preview, which maintains a price and multimodal advantage.

Frontier Model Releases Evaluation and Benchmarking Artificial Analysis Intelligence Index Claude Opus 4.6 Gemini Deep Think +16 more

5arXiv · cs.CL·Jun 2, 2026·source ↗

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

K-BrowseComp is a new 400-problem benchmark for evaluating web-browsing agents in Korean-language contexts, with a 300-problem manually validated subset and a 100-problem adversarially constructed synthetic split. Frontier models including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1 achieve only 30–46% on the verified subset, a significant drop from English BrowseComp performance, while Korean proprietary models score 0–10%. The benchmark exploits the asymmetry between problem creation and solving difficulty, and the adversarial synthetic split caps the strongest model at 26%, positioning it as a targeted stress test for agentic web-browsing capability.

Frontier Model Releases Evaluation and Benchmarking Korea Proprietary AI Foundation Model Program DeepSeek V4 BrowseComp +3 more

7The Batch·Jun 2, 2026·source ↗

Recursive Language Models Offer Path To Dramatically Expand Beyond the Context Window

MIT researchers Alex L. Zhang, Tim Kraska, and Omar Khattab propose Recursive Language Models (RLMs), a framework that offloads long-context processing to an external Python REPL environment, allowing models to programmatically fetch and manage text chunks via code generation. The root model spawns submodel instances to handle subtasks, aggregating their outputs recursively. Evaluated on benchmarks requiring reasoning over documents up to 11 million tokens, RLMs substantially outperform both base models and competing agentic strategies such as retrieval and summarization agents. For example, RLM-GPT-5 achieved 91.3% on BrowseComp+ versus GPT-5's inability to produce an answer, and ~50% accuracy on OOLONG-PAIRS at 1 million tokens versus near-zero for baseline approaches.

Long Context Evolution Evaluation and Benchmarking MIT OOLONG-PAIRS Tim Kraska +9 more

9Anthropic News·Jun 1, 2026·source ↗

Claude Opus 4.6 Released with 1M Token Context, Agentic Coding Advances, and State-of-the-Art Benchmarks

Anthropic has released Claude Opus 4.6, its most capable model to date, featuring a 1M token context window in beta, improved agentic coding and planning capabilities, and adaptive thinking with developer-controlled effort levels. The model claims top scores on Terminal-Bench 2.0, Humanity's Last Exam, GDPval-AA, and BrowseComp, outperforming OpenAI's GPT-5.2 by 144 Elo points on GDPval-AA. New product features include agent teams in Claude Code, context compaction for long-running tasks, and Claude in PowerPoint (research preview). Pricing remains unchanged at $5/$25 per million input/output tokens.

Long Context Evolution Frontier Model Releases GPT-5.2 Claude Opus 4.6 adaptive thinking +13 more

6Openai Blog·May 20, 2026·source ↗

BrowseComp: a benchmark for browsing agents

OpenAI has released BrowseComp, a benchmark designed to evaluate the capabilities of web-browsing AI agents. The benchmark appears to target the ability of agents to navigate and retrieve information from the web. As a Tier 1 source announcement, this represents OpenAI's effort to establish evaluation standards for agentic browsing behavior. Details on task structure, difficulty, and baseline results are not provided in the body text.

Evaluation and Benchmarking Agent and Tool Ecosystem BrowseComp OpenAI