Almanac
benchmark

BrowseComp

benchmarkactivebrowsecomp-e908fd4e·6 events·first seen 28d ago

Aliases: BrowseComp, BrowseComp+, K-BrowseComp

Co-occurring entities

More like this (12)

Recent events (6)

6Openai Blog·28d ago·source ↗

BrowseComp: a benchmark for browsing agents

OpenAI has released BrowseComp, a benchmark designed to evaluate the capabilities of web-browsing AI agents. The benchmark appears to target the ability of agents to navigate and retrieve information from the web. As a Tier 1 source announcement, this represents OpenAI's effort to establish evaluation standards for agentic browsing behavior. Details on task structure, difficulty, and baseline results are not provided in the body text.

5arXiv · cs.CL·15d ago·source ↗

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

K-BrowseComp is a new 400-problem benchmark for evaluating web-browsing agents in Korean-language contexts, with a 300-problem manually validated subset and a 100-problem adversarially constructed synthetic split. Frontier models including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1 achieve only 30–46% on the verified subset, a significant drop from English BrowseComp performance, while Korean proprietary models score 0–10%. The benchmark exploits the asymmetry between problem creation and solving difficulty, and the adversarial synthetic split caps the strongest model at 26%, positioning it as a targeted stress test for agentic web-browsing capability.

9Anthropic News·16d ago·source ↗

Claude Opus 4.6 Released with 1M Token Context, Agentic Coding Advances, and State-of-the-Art Benchmarks

Anthropic has released Claude Opus 4.6, its most capable model to date, featuring a 1M token context window in beta, improved agentic coding and planning capabilities, and adaptive thinking with developer-controlled effort levels. The model claims top scores on Terminal-Bench 2.0, Humanity's Last Exam, GDPval-AA, and BrowseComp, outperforming OpenAI's GPT-5.2 by 144 Elo points on GDPval-AA. New product features include agent teams in Claude Code, context compaction for long-running tasks, and Claude in PowerPoint (research preview). Pricing remains unchanged at $5/$25 per million input/output tokens.

7The Batch·15d ago·source ↗

Recursive Language Models Offer Path To Dramatically Expand Beyond the Context Window

MIT researchers Alex L. Zhang, Tim Kraska, and Omar Khattab propose Recursive Language Models (RLMs), a framework that offloads long-context processing to an external Python REPL environment, allowing models to programmatically fetch and manage text chunks via code generation. The root model spawns submodel instances to handle subtasks, aggregating their outputs recursively. Evaluated on benchmarks requiring reasoning over documents up to 11 million tokens, RLMs substantially outperform both base models and competing agentic strategies such as retrieval and summarization agents. For example, RLM-GPT-5 achieved 91.3% on BrowseComp+ versus GPT-5's inability to produce an answer, and ~50% accuracy on OOLONG-PAIRS at 1 million tokens versus near-zero for baseline approaches.

7The Batch·14d ago·source ↗

OpenAI GPT-5.4 Pro and GPT-5.4 Thinking challenge Gemini 3.1 Pro Preview for top AI model position

OpenAI released GPT-5.4 in two variants (Pro and Thinking), featuring expanded context windows up to 1.05M tokens, native computer use, tool search capabilities, and adjustable reasoning levels. In independent benchmarks by Artificial Analysis, GPT-5.4 Pro at xhigh reasoning nearly ties Gemini 3.1 Pro Preview on the Intelligence Index (57 vs 57.2 points) but at roughly 3.3x the cost, while leading on coding and agentic sub-indices. The release leapfrogs Claude Opus 4.6 on most benchmarks but faces stiff competition from Google's Gemini 3.1 Pro Preview, which maintains a price and multimodal advantage.

8The Batch·14d ago·source ↗

GPT-5.4 released with tool search, computer use, and frontier benchmark performance

OpenAI released GPT-5.4 in Thinking and Pro variants, featuring an expanded context window (up to 1.05M input tokens), native computer use, tool search capabilities, and adjustable reasoning levels. In independent testing by Artificial Analysis, GPT-5.4 Pro at xhigh reasoning achieved state-of-the-art on GDP-Val-AA, BrowseComp, Terminal-Bench-Hard, SWE-Bench-Pro, and MCP Atlas, while trailing Gemini 3.1 Pro Preview on MMMU-Pro and Humanity's Last Exam. Pricing is set at the top of the market ($30/$180 per million input/output tokens for Pro), and the release also powers Codex, OpenAI's competitor to Claude Code. The item is reported via The Batch (tier 2 commentary) and includes additional context on Andrew Ng's chub CLI tool for agent documentation sharing.