Almanac
benchmark

SWE-Bench Verified

benchmarkactiveswe-bench-verified-fd690fdf·18 events·first seen 1mo ago

Aliases: SWE-Bench Verified, SWE-bench-Verified-Mini

Co-occurring entities

More like this (12)

Recent events (18)

7Openai Blog·28d ago·source ↗

OpenAI Abandons SWE-bench Verified Over Contamination and Measurement Flaws

OpenAI has announced it will no longer evaluate models on SWE-bench Verified, citing benchmark contamination and flawed test cases that cause it to mismeasure frontier coding capabilities. Their analysis identified both problematic test design and training data leakage as sources of unreliability. OpenAI recommends SWE-bench Pro as a replacement benchmark for evaluating coding progress.

6Openai Blog·28d ago·source ↗

Introducing SWE-bench Verified

OpenAI is releasing SWE-bench Verified, a human-validated subset of the SWE-bench benchmark designed to more reliably evaluate AI models on real-world software engineering tasks. The original SWE-bench contained issues that were ambiguous or unsolvable, leading to unreliable scores; the Verified subset addresses this by having human annotators confirm task solvability and clarity. This provides a cleaner signal for comparing coding agent performance across labs.

8Mistral Ai News·1mo ago·source ↗

Mistral Releases Devstral 2 (123B) and Devstral Small 2 (24B) Coding Models Plus Vibe CLI Agent

Mistral AI has released Devstral 2, a 123B-parameter open-weight coding model scoring 72.2% on SWE-bench Verified, and Devstral Small 2, a 24B model scoring 68.0% on the same benchmark and deployable on consumer hardware. Both models support a 256K context window and are permissively licensed (modified MIT and Apache 2.0 respectively). Mistral also launched Vibe CLI, an open-source terminal-based coding agent powered by Devstral that supports multi-file orchestration, natural language code editing, and IDE integration via Agent Communication Protocol. Devstral 2 is currently free via API with post-free pricing of $0.40/$2.00 per million tokens input/output.

7Mistral Ai News·15d ago·source ↗

Mistral AI Releases Devstral: Apache 2.0 Agentic Coding Model with SWE-Bench SOTA

Mistral AI, in collaboration with All Hands AI, releases Devstral, an agentic LLM specialized for software engineering tasks under the Apache 2.0 license. The model achieves 46.8% on SWE-Bench Verified, surpassing prior open-source state-of-the-art by over 6 percentage points and outperforming larger models like DeepSeek-V3-0324 (671B) and Qwen3 232B-A22B under the same OpenHands scaffold. Devstral is small enough to run on a single RTX 4090 or a Mac with 32GB RAM, and is available via Mistral's API at $0.1/M input tokens, as well as on HuggingFace, Ollama, and other platforms. Mistral indicates a larger agentic coding model is in development.

7arXiv · cs.CL·22d ago·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

9Anthropic News·15d ago·source ↗

Anthropic Introduces Claude Opus 4 and Sonnet 4 with Leading Coding Benchmarks and Agent Capabilities

Anthropic has released Claude Opus 4 and Claude Sonnet 4, positioning Opus 4 as the world's best coding model with 72.5% on SWE-bench and 43.2% on Terminal-bench, and Sonnet 4 at 72.7% on SWE-bench. Both models are hybrid (near-instant + extended thinking), support extended thinking with tool use in beta, parallel tool execution, and improved memory via local file access. Alongside the models, Anthropic is launching Claude Code as generally available with GitHub Actions, VS Code, and JetBrains integrations, plus four new API capabilities: code execution tool, MCP connector, Files API, and one-hour prompt caching. Pricing is unchanged from prior Opus and Sonnet tiers ($15/$75 and $3/$15 per million tokens respectively), with availability on Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI.

7Mistral Ai News·15d ago·source ↗

Mistral Announces Codestral 25.08 and Integrated Enterprise Coding Stack

Mistral AI has released Codestral 25.08, a code generation model update claiming +30% accepted completions, 50% fewer runaway generations, and improved FIM benchmark performance. The announcement also frames a full enterprise coding stack comprising Codestral (completion), Codestral Embed (code-specific retrieval), and Devstral (agentic workflows via OpenHands), all deployable on-prem or in VPC environments. Devstral Medium is reported to achieve 61.6% on SWE-Bench Verified, while Devstral Small (24B, Apache-2.0) reaches 53.6%. The pitch targets regulated industries blocked by SaaS-only competitors through self-hostable, air-gapped deployment options.

7Mistral Ai News·15d ago·source ↗

Mistral AI Releases Devstral Medium and Devstral Small 1.1 for Agentic Coding

Mistral AI, in collaboration with All Hands AI, has released two new agentic coding models: Devstral Small 1.1 (24B parameters, Apache 2.0, 53.6% on SWE-Bench Verified) and Devstral Medium (61.6% on SWE-Bench Verified, API-only). Devstral Medium is positioned as a cost-performance leader, claiming to surpass Gemini 2.5 Pro and GPT-4.1 at roughly one-quarter the price, priced at $0.4/M input and $2/M output tokens. Devstral Small 1.1 sets a new state-of-the-art among open models for code agents without test-time scaling, and supports both Mistral function calling and XML formats for broad agentic scaffold compatibility.

7Anthropic News·15d ago·source ↗

Claude Opus 4.1 Released with 74.5% SWE-bench Verified Score

Anthropic has released Claude Opus 4.1, an incremental upgrade to Claude Opus 4 focused on agentic tasks, coding, and reasoning. The model achieves 74.5% on SWE-bench Verified (without extended thinking) and shows notable gains in multi-file code refactoring and large-codebase debugging. It is available to paid Claude users, Claude Code, and via API on Anthropic, Amazon Bedrock, and Google Cloud Vertex AI at the same price as Opus 4. Anthropic notes substantially larger model improvements are planned for the coming weeks.

8Mistral Ai News·1mo ago·source ↗

Mistral Launches Medium 3.5 (128B Open Weights), Remote Cloud Coding Agents in Vibe, and Work Mode in Le Chat

Mistral AI has released Mistral Medium 3.5, a 128B dense open-weights model with a 256k context window, configurable reasoning effort, and a vision encoder trained from scratch, scoring 77.6% on SWE-Bench Verified. Alongside the model, Mistral is launching remote cloud-based coding agents in its Vibe CLI and Le Chat interface, enabling async parallel coding sessions that run independently and notify users on completion. A new Work mode in Le Chat provides a multi-step agentic interface for cross-tool workflows including email, calendar, research, and issue tracking. Mistral Medium 3.5 replaces Devstral 2 as the default model in both Le Chat and the Vibe CLI, and is available for self-hosting on as few as four GPUs under a modified MIT license.

9Anthropic News·15d ago·source ↗

Claude 3.7 Sonnet and Claude Code: Anthropic's First Hybrid Reasoning Model and Agentic Coding Tool

Anthropic has released Claude 3.7 Sonnet, described as their most capable model to date and the first hybrid reasoning model on the market, capable of operating in both standard and extended thinking modes within a single unified model. The model achieves state-of-the-art results on SWE-bench Verified and TAU-bench, with particular strength in coding and front-end web development. Alongside the model, Anthropic is launching Claude Code in limited research preview, a command-line agentic coding tool that can read/edit files, run tests, and push to GitHub. Pricing remains unchanged at $3/M input and $15/M output tokens, with availability across Claude.ai plans, Amazon Bedrock, and Google Cloud Vertex AI.

9Anthropic News·15d ago·source ↗

Anthropic Releases Claude Sonnet 4.5: Top Coding and Computer-Use Model with Agent SDK

Anthropic has released Claude Sonnet 4.5, claiming it is the best coding model and strongest model for building complex agents, with a 61.4% score on OSWorld (up from 42.2% for Sonnet 4) and state-of-the-art performance on SWE-bench Verified. The release is accompanied by major product upgrades including checkpoints in Claude Code, a native VS Code extension, a Claude Agent SDK giving developers access to the same infrastructure powering Claude Code, and new context editing and memory tools in the Claude API. Pricing is unchanged from Sonnet 4 at $3/$15 per million input/output tokens. Early enterprise customers including Cursor, GitHub Copilot, Devin, Canva, and Figma report significant gains in coding, agentic, and long-context tasks.

9Anthropic News·14d ago·source ↗

Anthropic introduces computer use capability, upgraded Claude 3.5 Sonnet, and Claude 3.5 Haiku

Anthropic announced three major developments: an upgraded Claude 3.5 Sonnet with significant coding improvements (SWE-bench Verified rising from 33.4% to 49.0%, surpassing all publicly available models including reasoning models), a new Claude 3.5 Haiku that matches Claude 3 Opus performance at Haiku-tier speed, and a public beta of 'computer use' — a capability allowing Claude to control computers by viewing screens, moving cursors, clicking, and typing. Computer use is available via the Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI, with early adopters including Replit, The Browser Company, and Cognition. Both safety institutes (US AISI and UK AISI) conducted pre-deployment testing, and the model was assessed as remaining within ASL-2 under Anthropic's Responsible Scaling Policy.

7Anthropic News·15d ago·source ↗

Anthropic Launches Claude Haiku 4.5: Near-Frontier Performance at $1/$5 per Million Tokens

Anthropic has released Claude Haiku 4.5, a small model priced at $1/$5 per million input/output tokens that delivers coding performance comparable to Claude Sonnet 4 at one-third the cost and more than twice the speed. The model surpasses Sonnet 4 on computer use tasks and achieves 90% of Sonnet 4.5's performance on agentic coding evaluations, running 4-5x faster than Sonnet 4.5. Notably, Haiku 4.5 is classified under ASL-2 safety standards—less restrictive than the ASL-3 applied to Sonnet 4.5 and Opus 4.1—and is described as Anthropic's safest model by automated alignment metrics. It is available via the Claude API, Amazon Bedrock, and Google Cloud Vertex AI.

7Anthropic News·14d ago·source ↗

Claude 3.5 Sonnet begins rollout on GitHub Copilot via Amazon Bedrock

Anthropic's Claude 3.5 Sonnet is now rolling out on GitHub Copilot, available in public preview for all Copilot Chat users in Visual Studio Code and GitHub.com. The model claims top performance on SWE-bench Verified among publicly available models and 93.7% on HumanEval. The integration runs via Amazon Bedrock's cross-region inference and reaches GitHub's community of over 100 million developers, representing a significant distribution milestone for Claude.

6arXiv · cs.AI·7d ago·source ↗

Frontier coding agents use metaprogramming to handle esoteric programming languages

A new arXiv paper evaluates six LLM-based coding agents on four esoteric programming languages (including Brainfuck and Befunge-98), finding that the strongest agents—Claude Opus 4.6 and GPT-5.4 xhigh—often avoid writing the target language directly, instead generating it via Python metaprograms. Forbidding this strategy causes large performance drops, and text guidance alone does not transfer the capability to weaker models, though sharing Opus-derived Python helper code does sharply improve mid-tier agents. The study reveals capability stratification that mainstream benchmarks like SWE-Bench Verified compress into narrow bands, suggesting frontier agents succeed by constructing and debugging working models of unfamiliar environments rather than pattern-matching to training data.

6Deepseek News·1mo ago·source ↗

DeepSeek-V2.5: Merged Open-Source Model Combining General and Coding Capabilities

DeepSeek has released DeepSeek-V2.5, an open-source model that merges DeepSeek-V2-Chat-0628 and DeepSeek-Coder-V2-0724 into a single unified model. The release improves general conversational capabilities, coding performance, instruction-following, and writing tasks while also strengthening safety properties—raising the overall safety score from 74.4% to 82.6% and reducing safety spillover rate from 11.3% to 4.6%. The model is available via backward-compatible API endpoints (deepseek-chat and deepseek-coder) and on HuggingFace, retaining features like Function Calling, FIM completion, and JSON output. Benchmark results show improvements on HumanEval Python and LiveCodeBench, though SWE-verified performance remains an acknowledged weak area.

5arXiv · cs.CL·6d ago·source ↗

Claw-SWE-Bench: A benchmark for evaluating agent harnesses on multilingual coding tasks

Researchers introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol designed to fairly compare heterogeneous agent harnesses ("claws") on GitHub issue-resolution tasks. The benchmark contains 350 instances across 8 languages and 43 repositories, with an 80-instance Lite subset for cost-efficient validation. Key findings show adapter design dominates raw model choice: a minimal adapter scores 19.1% Pass@1 versus 73.4% for a full adapter using the same GLM 5.1 backbone, and harness choice and model choice each shift Pass@1 by roughly 27-29 percentage points. The work also introduces cost accounting as a first-class evaluation axis alongside accuracy.