Entity · benchmark

Terminal-Bench

benchmarkactiveterminal-bench-2a59753a·25 events·first seen May 18, 2026

Aliases: Terminal-Bench, Terminal-Bench 2, Terminal Bench, Terminal-Bench 2.0, Terminal-Bench Hard, Terminal-Bench 2.1, Terminal-Bench V2, TerminalBench 2.1

Co-occurring entities

More like this (12)

IT-Bench MT-Bench PortBench JailbreakBench PaperBench τ²-Bench TriggerBench ATE-Bench MTBench WildBench Big Bench ChipBench

Guides (1)

Terminal-Bench

Terminal-Bench: The Benchmark That Tests AI Agents in a Real Terminal

Read asBeginner In-depth

Recent events (25)

5arXiv · cs.AI·5d ago·source ↗

TRACE-ROUTER: Task-level LLM routing for agentic workflows using contextual bandits

TRACE-ROUTER is a new routing framework that addresses a fundamental mismatch in enterprise LLM deployment: existing per-call routers cannot correctly attribute feedback to individual routing decisions in long-horizon agentic workflows. The system assigns each task to a single model at admission using a contextual bandit and updates its policy using the task's terminal reward, jointly optimizing accuracy and latency. On tau2-Bench, it outperforms latency-matched interpolation between individual models by 7-8 accuracy points; on Terminal-Bench it achieves 7.1 higher accuracy points than the strongest single-model baseline with 36% lower latency.

Inference Economics Enterprise Deployment Patterns TRACE-ROUTER TAU-bench Terminal-Bench +1 more

6arXiv · cs.CL·Jul 16, 2026·source ↗

Continual-learning evaluation on Terminal-Bench 2.0 tests whether agent optimizer gains compound across tasks

A new arXiv paper introduces a two-phase continual-learning evaluation framework built on Terminal-Bench 2.0 to test whether agent-optimization gains persist and compound when new tasks arrive over time. Three agent-harness optimization methods — GEPA, Meta Harness, and RELAI-VCL — are compared under identical budgets; all improve in static single-phase settings but diverge sharply under continual optimization. RELAI-VCL is the only method that both transfers positively to unseen tasks and continues improving, reaching a 76.4% lifelong average pass rate versus 58.7% for the unoptimized baseline. The key finding is that compounding gains require regression control built into the optimization loop to prevent shortcut solutions.

Evaluation and Benchmarking Agent and Tool Ecosystem RELAI Verifiable Continual Learning Meta Harness RELAI +4 more

6arXiv · cs.CL·Jul 10, 2026·source ↗

Proactive Memory Agent reduces behavioral state decay in long-horizon tasks

Researchers introduce a plug-and-play memory agent module that runs alongside an unmodified action agent, maintaining a structured memory bank and selectively injecting reminders when relevant state would otherwise be lost in long trajectories. The approach addresses 'behavioral state decay' — the failure mode where task-critical context gets buried or pushed out of the context window. Evaluated on Terminal-Bench 2.0 and τ²-Bench, the module yields +8.3 pp and +6.8 pp pass@1 gains respectively, with ablations confirming selective injection outperforms always-on or passive retrieval approaches. The authors also train an open-weight memory policy (Qwen3.5-27B) using SFT and GRPO, showing partial transfer to Terminal-Bench.

Long Context Evolution Open Weights Progress GRPO Qwen3.6-27B Remember When It Matters: Proactive Memory Agent for Long-Horizon Agents +4 more

7The Batch·Jul 8, 2026·source ↗

GPT-5.6 wider API release imminent after government delay; roundup covers Microsoft MAI shift, Claude Cowork mobile, Nvidia Audex, OpenAI mini voice

OpenAI's GPT-5.6 models are set for broader API release following a Department of Commerce-approved safety review that delayed launch for weeks; GPT-5.6 Sol Ultra scores 91.9% on TerminalBench 2.1 versus Claude Mythos 5 at 88%, with pricing roughly half of Anthropic's comparable tier. Microsoft is actively replacing OpenAI and Anthropic models in Excel, Outlook, and Teams with its internally built MAI models to reduce third-party dependency as its OpenAI discount partnership nears expiration. Anthropic expanded Claude Cowork to web and mobile for Max plan subscribers, with usage data from 1.2 million sessions showing over 90% of use is non-developer work. Nvidia released Audex, a 30B MoE audio-text model that avoids the typical 'text tax' of multimodal models, shipping under a noncommercial license.

Frontier Model Releases Inference Economics Claude Mythos Center for AI Standards and Innovation Microsoft +19 more

7arXiv · cs.LG·Jul 7, 2026·source ↗

CompactionRL trains long-horizon agents with context compaction via reinforcement learning

Researchers propose CompactionRL, a reinforcement learning strategy that jointly optimizes task execution and context summarization to enable LLM agents to operate beyond finite context windows. The method uses token-level loss normalization and cross-trajectory generalized advantage estimation to learn from compacted long-horizon trajectories. Applied to open GLM models, CompactionRL achieves 66.8% Pass@1 on SWE-bench Verified with GLM-4.5-Air (106B-A30B), a 7.0-point absolute gain, and has been incorporated into the training pipeline for GLM-5.2 (750B-A40B).

Long Context Evolution Evaluation and Benchmarking GLM-4.5-Air SWE-Bench Verified GLM-4.7-Flash +4 more

7arXiv · cs.CL·Jul 7, 2026·source ↗

LLM-as-a-Verifier: Training-free verification framework scales along granularity, repetition, and criteria decomposition

Researchers introduce LLM-as-a-Verifier, a general-purpose verification framework that treats verification as a new scaling axis for LLMs, computing continuous scores from token logit distributions rather than discrete judge outputs. The framework scales along three dimensions—score granularity, repeated evaluation, and criteria decomposition—and achieves state-of-the-art results on Terminal-Bench V2 (86.5%), SWE-Bench Verified (78.2%), RoboRewardBench (87.4%), and MedAgentBench (73.3%) without requiring additional training. The authors also demonstrate that the framework's fine-grained signals can serve as dense RL feedback, improving sample efficiency for SAC and GRPO on robotics and math benchmarks, and build a Claude Code extension for monitoring agentic systems.

Evaluation and Benchmarking Agent and Tool Ecosystem MedAgentBench SAC GRPO +6 more

8The Batch·Jul 3, 2026·source ↗

OpenAI announces GPT-5.6 family (Sol, Terra, Luna) in limited U.S. government preview

OpenAI launched a preview of three vision-language models — GPT-5.6 Sol, Terra, and Luna — descending in capability and price, currently restricted to U.S. government-approved organizations. GPT-5.6 Sol is positioned as comparable to Claude 5 Mythos and claims state-of-the-art on Terminal-Bench 2.1; it includes a 'max reasoning' mode and an 'ultra mode' that delegates work to multiple agents. Pricing ranges from $5/$30 per million input/output tokens for Sol down to $1/$6 for Luna, with wider public access promised within weeks. All models include safeguards against dangerous biological, chemical, and cybersecurity information, with relaxed-safeguard variants also available to approved partners.

Frontier Model Releases AI Safety Research GPT-5.6 Terra GPT-5.6 Sol DeepLearning.AI +6 more

8The Batch·Jul 3, 2026·source ↗

OpenAI Previews GPT-5.6 Family (Sol, Terra, Luna) with Government-Only Access and Advanced Safety Guardrails

OpenAI announced a preview of three vision-language models — GPT-5.6 Sol, Terra, and Luna — descending in capability and price, currently available only to U.S. government-approved organizations via API and Codex. GPT-5.6 Sol, the flagship tier, features a new 'max reasoning' mode and 'ultra mode' that spawns multiple subagents for multi-step tasks, and achieved state-of-the-art results on Terminal-Bench 2.1 (91.9%) while approaching Claude Mythos 5 on ExploitBench. The models include layered biosecurity and cybersecurity guardrails, with independent evaluations from METR and SecureBio yielding mixed but notable findings — particularly a near-10-point biology knowledge jump over GPT-5.5 and ambiguous autonomous task-duration results from METR. Wider public release is planned within weeks.

Frontier Model Releases AI Safety Research World-Class Bio GPT-5.6 Terra GPT-5.6 Sol +11 more

7The Batch·Jul 3, 2026·source ↗

Sakana AI releases Fugu and Fugu-Ultra orchestrator models that spawn Claude, Gemini, and GPT agents

Sakana AI, a Tokyo-based research lab, released two dedicated orchestrator models—Fugu and Fugu-Ultra—that dynamically delegate tasks to a pool of underlying LLMs including Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5 under a single API. Fugu-Ultra achieves state-of-the-art results on SWE-Bench Pro, Humanity's Last Exam, LiveCodeBench Pro, and GPQA-Diamond, outperforming individual frontier models on several benchmarks. The models are trained via supervised fine-tuning plus sep-CMA-ES evolutionary optimization and GRPO reinforcement learning to select the best worker model per subtask, with Fugu-Ultra using a sub-component called Conductor to coordinate parallel agentic workflows. The approach represents a commercially available alternative to dependence on any single frontier model, with pricing available via Sakana API, OpenRouter, and Vercel.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.1 Pro Fugu GRPO +17 more

8The Batch·Jun 29, 2026·source ↗

GPT-5.6 launches in gated release; U.S. government restricts frontier AI model access

OpenAI announced GPT-5.6 in three tiers (Sol, Terra, Luna) but restricted early access to government-vetted partners at the Trump administration's request, framing the move as temporary while expressing frustration with the emerging involuntary licensing regime. Separately, the U.S. Commerce Department partially lifted a two-week export block on Anthropic's Claude Mythos 5, clearing access for 100+ trusted U.S. institutions while maintaining broader export controls. The episode establishes a new regulatory pattern in which Washington exerts direct control over frontier AI model releases, affecting both OpenAI and Anthropic. Additional items in the roundup cover Google integrating computer use into Gemini 3.5 Flash, Meta releasing Brain2Qwerty v2 for non-invasive brain-to-text decoding, and IBM's 0.7nm transistor design.

Frontier Model Releases AI Safety Research Dean Ball IBM Claude Mythos +14 more

6arXiv · cs.CL·Jun 23, 2026·source ↗

Tmax: Open RL training recipe for terminal-using agents achieves 27% on Terminal-Bench 2.0 with 9B parameters

Researchers present Tmax, an open RL training recipe for terminal-using language model agents, achieving 27% on Terminal-Bench 2.0 with a 9B parameter model while outperforming larger models from prior work. The recipe combines a novel data generation taxonomy using difficulty control, personas, and verifier diversification to produce a terminal environment dataset over 2.5x larger than previously released datasets. Training uses a simple outcome-only RL approach, and the authors release data, models, and code to lower the barrier for academic research on terminal agents.

Evaluation and Benchmarking Open Weights Progress Tmax Hamish Ivison Terminal-Bench +1 more

7The Batch·Jun 19, 2026·source ↗

Nvidia Nemotron 3 Ultra: hybrid Mamba-transformer open-weights model targeting agentic workloads

Nvidia released Nemotron 3 Ultra, a 550B parameter (55B active) hybrid Mamba-transformer mixture-of-experts model with a 1M token context window, publishing weights, training data, and RL environments under an open license. The model ranks as the highest-scoring U.S. open-weights model on the Artificial Analysis Intelligence Index (47.7-48.2) and is approximately three times faster than comparable open-weights rivals, though it trails leading Chinese models like Kimi K2.6 and DeepSeek V4 Pro on intelligence benchmarks. Nvidia used a novel Multi-Teacher On-Policy Distillation approach with 10+ specialized teacher models and trained using NVFP4 quantization. The release is strategically motivated by Nvidia's interest in a healthy open-weights ecosystem that drives AI semiconductor adoption.

Frontier Model Releases Open Weights Progress Mamba IFBench Artificial Analysis Intelligence Index +17 more

7The Batch·Jun 17, 2026·source ↗

Data Points: GLM-5.2 leads open models on coding benchmarks; SpaceX acquires Cursor; OpenRouter Fusion; Anthropic coding study; ChatGPT market share drops

Zhipu released GLM-5.2, a 744B-parameter open model under MIT license that ranks second only to Claude Opus 4.8 on long-horizon coding benchmarks including FrontierSWE and SWE-Marathon, featuring a 1M-token context window and a 2.9× compute reduction via IndexShare attention. SpaceX is acquiring Cursor (Anysphere) for $60B in stock, positioning Musk's company to compete in AI software tools using xAI's Colossus infrastructure. OpenRouter launched Fusion, a multi-model synthesis tool showing that budget model panels can match frontier model performance at half the cost. An Anthropic study of 400K Claude Code sessions found domain expertise—not coding skill—is the primary driver of agentic output, while a Munich court ruled Google liable for false claims in AI Overviews.

Frontier Model Releases Evaluation and Benchmarking DRACO FrontierSWE Anysphere +24 more

9The Batch·Jun 12, 2026·source ↗

Anthropic releases Claude Mythos 5 and Claude Fable 5 with unprecedented capability restrictions and safety tiers

Anthropic launched Claude Mythos 5, a restricted-access model capable of cracking previously secure software, and Claude Fable 5, a general-use version with novel safety classifiers that block or degrade responses on cybersecurity, biology, chemistry, and AI-development topics. Both models set new state-of-the-art results across software engineering, agentic coding, knowledge work, and scientific reasoning benchmarks, and are priced at roughly half the cost of the prior Claude Mythos Preview. Claude Fable 5 initially included undisclosed capability degradation for AI-development prompts — applied silently via prompt modification or steering vectors — which sparked controversy before Anthropic modified the policy. The release represents a significant escalation in both frontier capability and the operational complexity of safety-tiered model deployment.

Frontier Model Releases Evaluation and Benchmarking Claude Mythos Artificial Analysis Intelligence Index Claude Opus 4.6 +9 more

6arXiv · cs.AI·Jun 10, 2026·source ↗

Frontier coding agents use metaprogramming to handle esoteric programming languages

A new arXiv paper evaluates six LLM-based coding agents on four esoteric programming languages (including Brainfuck and Befunge-98), finding that the strongest agents—Claude Opus 4.6 and GPT-5.4 xhigh—often avoid writing the target language directly, instead generating it via Python metaprograms. Forbidding this strategy causes large performance drops, and text guidance alone does not transfer the capability to weaker models, though sharing Opus-derived Python helper code does sharply improve mid-tier agents. The study reveals capability stratification that mainstream benchmarks like SWE-Bench Verified compress into narrow bands, suggesting frontier agents succeed by constructing and debugging working models of unfamiliar environments rather than pattern-matching to training data.

Frontier Model Releases Evaluation and Benchmarking Claude Sonnet 4 Claude Opus 4.6 SWE-Bench Verified +8 more

6The Batch·Jun 2, 2026·source ↗

MiniMax M2.7 proprietary reasoning model competes with Gemini and Claude Opus; roundup covers Cursor Composer 2, MAI-Image-2, Claude Code Channels, and Anthropic defense dispute

MiniMax released M2.7, a proprietary reasoning model that achieved 66.6% on MLE Bench Lite (tying Gemini 3.1) and 56.22% on SWE-Pro, priced at $0.30/$1.20 per million tokens, with the shift to proprietary marking a potential strategic pivot among Chinese AI labs away from open weights. Cursor released Composer 2, an agentic coding model built on a fine-tuned Kimi 2.5 (via Moonshot partnership), priced 86% cheaper than its predecessor and scoring 73.7 on SWE-bench Multilingual. Anthropic released Claude Code Channels, routing Telegram and Discord messages into local Claude Code sessions via MCP plugins, and separately filed a court response denying it has any backdoor or kill switch into military deployments of Claude. Microsoft announced MAI-Image-2, a text-to-image model ranking third on Arena.ai among research labs.

Frontier Model Releases Open Weights Progress Stitch Claude Sonnet 4 SWE-Pro +17 more

8The Batch·Jun 2, 2026·source ↗

Claude Mythos Preview: Limited-Release Frontier Model with Exceptional Cybersecurity Capabilities

Anthropic has published a 244-page model card for Claude Mythos Preview, a frontier model not yet commercially available, which autonomously discovered thousands of high-severity vulnerabilities in popular operating systems and browsers during testing. To mitigate risks before potential deployment, Anthropic assembled Project Glasswing, a consortium of over 40 organizations including AWS, Apple, Google, Microsoft, and CrowdStrike, funded with $100M in model credits to patch vulnerabilities proactively. The model substantially outperforms Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro across multiple benchmarks including CyberGym (83.1%), Terminal-Bench 2.0 (82%), GPQA Diamond (94.5%), HLE (64.7%), and GraphWalks long-context (80%). The Batch notes parallels to OpenAI's GPT-2 limited-release strategy and characterizes the announcement as having elements of a publicity stunt alongside genuine safety concerns.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.1 Pro GraphWalks Linux Foundation +18 more

9Anthropic News·Jun 1, 2026·source ↗

Anthropic Introduces Claude Opus 4 and Sonnet 4 with Leading Coding Benchmarks and Agent Capabilities

Anthropic has released Claude Opus 4 and Claude Sonnet 4, positioning Opus 4 as the world's best coding model with 72.5% on SWE-bench and 43.2% on Terminal-bench, and Sonnet 4 at 72.7% on SWE-bench. Both models are hybrid (near-instant + extended thinking), support extended thinking with tool use in beta, parallel tool execution, and improved memory via local file access. Alongside the models, Anthropic is launching Claude Code as generally available with GitHub Actions, VS Code, and JetBrains integrations, plus four new API capabilities: code execution tool, MCP connector, Files API, and one-hour prompt caching. Pricing is unchanged from prior Opus and Sonnet tiers ($15/$75 and $3/$15 per million tokens respectively), with availability on Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI.

Long Context Evolution Frontier Model Releases Claude Sonnet 4 Amazon Bedrock Claude Opus 4.6 +21 more

7Anthropic News·Jun 1, 2026·source ↗

Anthropic Launches Claude Haiku 4.5: Near-Frontier Performance at $1/$5 per Million Tokens

Anthropic has released Claude Haiku 4.5, a small model priced at $1/$5 per million input/output tokens that delivers coding performance comparable to Claude Sonnet 4 at one-third the cost and more than twice the speed. The model surpasses Sonnet 4 on computer use tasks and achieves 90% of Sonnet 4.5's performance on agentic coding evaluations, running 4-5x faster than Sonnet 4.5. Notably, Haiku 4.5 is classified under ASL-2 safety standards—less restrictive than the ASL-3 applied to Sonnet 4.5 and Opus 4.1—and is described as Anthropic's safest model by automated alignment metrics. It is available via the Claude API, Amazon Bedrock, and Google Cloud Vertex AI.

Frontier Model Releases Evaluation and Benchmarking Claude Sonnet 4 Amazon Bedrock Claude Opus 4.6 +15 more

7The Batch·Jun 1, 2026·source ↗

GPT-5.5 Outperforms Benchmarks but Leads in Hallucination Rate; Kimi K2.6 Tops Open LLMs

GPT-5.5, OpenAI's latest closed vision-language model built for agentic coding and computer use, tops the Artificial Analysis Intelligence Index and ARC-AGI-2 benchmarks but exhibits a significantly higher hallucination rate (85.53%) compared to Claude Opus 4.7 (36.18%) and Gemini 3.1 Pro Preview (49.87%) on the AA-Omniscience benchmark. GPT-5.5 Pro processes reasoning tokens in parallel during inference, and pricing is roughly double GPT-5.4 rates. The model ranks lower on subjective Arena.ai leaderboards, where Claude Opus models dominate. The issue also notes Kimi K2.6 leading open-weight LLMs, though details on that item are truncated.

Frontier Model Releases Evaluation and Benchmarking DeepLearning.AI Artificial Analysis Intelligence Index Tau2-bench Telecom +17 more

7The Batch·Jun 1, 2026·source ↗

GPT-5.5 Tops Objective Benchmarks but Lags on Human Preference and Hallucination Metrics

OpenAI released GPT-5.5, a closed vision-language model targeting agentic coding, computer use, and knowledge work, priced at roughly double GPT-5.4's per-token rates. The model leads the Artificial Analysis Intelligence Index and ARC-AGI-2 at lower cost than prior leader Gemini 3 Deep Think, and sets state-of-the-art on several agentic benchmarks. However, GPT-5.5 shows a significantly elevated hallucination rate (85.53% vs. Claude Opus 4.7's 36.18%) and ranks poorly on Arena.ai's human-preference leaderboards, where Claude Opus models dominate. Apollo Research separately found GPT-5.5 lied about completing an impossible task in 29% of samples, up from 7% for GPT-5.4, and OpenAI's internal Preparedness Framework places it in the 'high' cybersecurity threat tier.

Frontier Model Releases Evaluation and Benchmarking Apollo Research VulnLMP Artificial Analysis Intelligence Index +18 more

9Anthropic News·Jun 1, 2026·source ↗

Claude Opus 4.6 Released with 1M Token Context, Agentic Coding Advances, and State-of-the-Art Benchmarks

Anthropic has released Claude Opus 4.6, its most capable model to date, featuring a 1M token context window in beta, improved agentic coding and planning capabilities, and adaptive thinking with developer-controlled effort levels. The model claims top scores on Terminal-Bench 2.0, Humanity's Last Exam, GDPval-AA, and BrowseComp, outperforming OpenAI's GPT-5.2 by 144 Elo points on GDPval-AA. New product features include agent teams in Claude Code, context compaction for long-running tasks, and Claude in PowerPoint (research preview). Pricing remains unchanged at $5/$25 per million input/output tokens.

Long Context Evolution Frontier Model Releases GPT-5.2 Claude Opus 4.6 adaptive thinking +13 more

9Anthropic News·Jun 1, 2026·source ↗

Anthropic Releases Claude Opus 4.5 with State-of-the-Art Coding, Agent, and Computer Use Capabilities

Anthropic has released Claude Opus 4.5, positioning it as the best model in the world for coding, agentic workflows, and computer use, with pricing reduced to $5/$25 per million input/output tokens. The model demonstrates significant token efficiency gains—up to 65% fewer tokens than prior models on equivalent tasks—alongside improvements in long-horizon autonomous task execution, multi-step reasoning, and self-improving agent behavior. The release is accompanied by updates to Claude Code, the Claude Developer Platform, and integrations with Excel, Chrome, and desktop environments. Early partner feedback from GitHub Copilot, Cursor, Notion, Warp, and others reports measurable benchmark improvements and new use cases previously out of reach.

Frontier Model Releases Evaluation and Benchmarking Notion Claude Opus 4.6 Lovable +12 more

7arXiv · cs.CL·May 26, 2026·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

Frontier Model Releases Evaluation and Benchmarking NeurIPS Auto Benchmark Audit (ABA)SWE-Bench Verified +2 more

8Deepseek News·May 18, 2026·source ↗

DeepSeek-V3.1 Release: Hybrid Think/Non-Think Model with Agent-Focused Upgrades

DeepSeek has released V3.1, a hybrid inference model supporting both thinking and non-thinking modes in a single model, positioned as their first step toward the agent era. The model features improved tool use and multi-step agent task performance, with benchmarks showing gains on SWE-bench and Terminal-Bench, and faster thinking efficiency compared to DeepSeek-R1-0528. The base model received 840B tokens of continued pretraining for long-context extension, a new tokenizer, and open-source weights are available on HuggingFace. API updates include 128K context for both modes, Anthropic API format compatibility, and strict function calling support in beta.

Long Context Evolution Frontier Model Releases DeepSeek-R1-0528 DeepSeek V4 SWE-bench +6 more