Entity · benchmark

SWE-Bench Verified

benchmarkactiveswe-bench-verified-fd690fdf·27 events·first seen May 18, 2026

Aliases: SWE-Bench Verified, SWE-bench-Verified-Mini

Co-occurring entities

More like this (12)

SWE-bench SWE-Bench Lite SWE-Bench Multilingual SWE-Bench-Pro-Hard-AA WildBench SpecBench Claw-SWE-Bench SWE-Pro SorryBench ESI-Bench SWE-Gym SWE-Perf

Guides (1)

SWE-Bench Verified

SWE-Bench Verified: The Coding AI Report Card

Read asBeginner In-depth

Recent events (27)

6arXiv · cs.AI·30h ago·source ↗

PAIChecker finds 13.6% misalignment in SWE-bench Verified instances, proposes multi-agent fix

A new arXiv paper systematically audits SWE-bench Verified and finds that 13.6% of PR-Issue pairings exhibit misalignment across five patterns and eleven fine-grained scenarios, undermining the benchmark's validity as an LLM evaluation tool. The authors introduce PAIChecker, a three-phase multi-agent system for detecting such misalignment, achieving up to 92.12% binary accuracy on SWE-Gym and 91.67% on SWE-bench Multilingual. The finding is significant because SWE-bench is one of the most widely cited benchmarks for agentic coding capability, and systematic data quality issues could distort leaderboard rankings and capability claims.

Evaluation and Benchmarking Agent and Tool Ecosystem SWE-Gym SWE-Bench Multilingual SWE-Bench Verified +1 more

6arXiv · cs.LG·2d ago·source ↗

MindForge pipeline fine-tunes small models for whole-life-cycle software engineering via source-free program synthesis

MindForge is an automated pipeline that converts open-source command-line programs into source-free training environments exposing only compiled executables and documentation, enabling training data generation for from-scratch program synthesis. Using GLM-5.2 as a teacher agent, the authors fine-tune Qwen3.6-27B on synthesized trajectories, raising its ProgramBench pass rate from 37.98% to 49.51% and achieving gains across seven held-out benchmarks including SWE-bench Verified (+5.04) and RepoZero-C2Rust (+31.00). The work addresses a gap in coding agent training infrastructure by spanning the full software engineering life cycle rather than single-phase tasks. The result is notable for achieving frontier-comparable performance on a 27B model through targeted data curation.

Evaluation and Benchmarking Open Weights Progress FeatBench MindForge NL2Repo-Bench +9 more

5arXiv · cs.CL·Jul 21, 2026·source ↗

SWE-Pruner Pro uses agent's internal representations for efficient context pruning in coding agents

SWE-Pruner Pro is a new context pruning method for coding agents that leverages the agent's own internal representations to decide which lines of tool output to keep or discard, rather than relying on a separate classifier. A lightweight head converts these internal representations into keep-or-prune labels per line, augmented with a length-aware embedding. Evaluated across two open-weight backbones and four multi-turn benchmarks, the method saves up to 39% of prompt and completion tokens while maintaining task quality, and on MiMo-V2-Flash additionally improves SWE-Bench Verified resolve rate by +3.8%.

Long Context Evolution Inference Economics SWE-Pruner Pro MiMo-V2-Flash SWE-Bench Verified +2 more

6arXiv · cs.CL·Jul 15, 2026·source ↗

Function-Aware Fill-in-the-Middle Mid-Training Improves Coding Agent Foundation Models

Researchers propose a self-supervised mid-training objective called function-aware fill-in-the-middle (FIM) that exploits the structural isomorphism between a coding agent's action-observation-continuation loop and function call sites in ordinary code. Applied to Qwen2.5-Coder-Instruct (7B/14B) and Qwen3-8B on a 2.6B-token GitHub corpus, the method yields +2.8 to +5.4 point gains on SWE-Bench-Verified and SWE-Bench-Lite across multiple post-training pipelines. Notably, the technique also mitigates capability erosion on non-agent coding and tool-use benchmarks, suggesting the function-call inductive bias generalizes beyond the training domain.

Frontier Model Releases Evaluation and Benchmarking SWE-Smith SWE-Bench Lite Qwen2.5-Coder-32B-Instruct +8 more

7arXiv · cs.LG·Jul 7, 2026·source ↗

CompactionRL trains long-horizon agents with context compaction via reinforcement learning

Researchers propose CompactionRL, a reinforcement learning strategy that jointly optimizes task execution and context summarization to enable LLM agents to operate beyond finite context windows. The method uses token-level loss normalization and cross-trajectory generalized advantage estimation to learn from compacted long-horizon trajectories. Applied to open GLM models, CompactionRL achieves 66.8% Pass@1 on SWE-bench Verified with GLM-4.5-Air (106B-A30B), a 7.0-point absolute gain, and has been incorporated into the training pipeline for GLM-5.2 (750B-A40B).

Long Context Evolution Evaluation and Benchmarking GLM-4.5-Air SWE-Bench Verified GLM-4.7-Flash +4 more

7arXiv · cs.CL·Jul 7, 2026·source ↗

LLM-as-a-Verifier: Training-free verification framework scales along granularity, repetition, and criteria decomposition

Researchers introduce LLM-as-a-Verifier, a general-purpose verification framework that treats verification as a new scaling axis for LLMs, computing continuous scores from token logit distributions rather than discrete judge outputs. The framework scales along three dimensions—score granularity, repeated evaluation, and criteria decomposition—and achieves state-of-the-art results on Terminal-Bench V2 (86.5%), SWE-Bench Verified (78.2%), RoboRewardBench (87.4%), and MedAgentBench (73.3%) without requiring additional training. The authors also demonstrate that the framework's fine-grained signals can serve as dense RL feedback, improving sample efficiency for SAC and GRPO on robotics and math benchmarks, and build a Claude Code extension for monitoring agentic systems.

Evaluation and Benchmarking Agent and Tool Ecosystem MedAgentBench SAC GRPO +6 more

7arXiv · cs.LG·Jun 30, 2026·source ↗

SWE-Interact benchmark evaluates coding agents on multi-turn, user-driven software engineering tasks

SWE-Interact is a new benchmark testbed that evaluates coding agents in realistic multi-turn developer workflows, where a user simulator starts with vague instructions and progressively reveals requirements. Unlike existing SWE benchmarks that provide complete specs upfront, SWE-Interact tests interactive goal discovery and iterative refinement. Frontier models including Claude Opus 4.8 and GPT-5.5 solve ~50% of single-turn baseline tasks but only ~25% of SWE-Interact tasks, revealing a significant capability gap. The benchmark is grounded in large-scale studies of real coding-agent interactions and identifies failure modes like over-agentic coding, requirement forgetting, and early abandonment under ambiguity.

Evaluation and Benchmarking Agent and Tool Ecosystem SWE-Interact SWE-Bench Verified OpenAI +3 more

6arXiv · cs.CL·Jun 24, 2026·source ↗

SHERLOC: Training-free structured fault localization framework boosts code repair agent performance on SWE-Bench

SHERLOC is a training-free localization framework that pairs a reasoning LLM with compact repository tools to produce structured diagnostic context for code repair agents, rather than bare file pointers. It achieves 84.33% accuracy@1 on SWE-Bench Lite and 81.27% recall@1 on SWE-Bench Verified at ~30B parameters, matching or outperforming larger agentic methods. Injecting SHERLOC's diagnostic output into downstream repair agents yields an average +5.95 percentage point resolve rate improvement on SWE-Bench Verified while reducing localization tokens by 36.7% and total tokens by 23.1%. The work addresses a concrete inefficiency in agentic coding pipelines where roughly half the inference budget is spent on fault localization before any editing begins.

Evaluation and Benchmarking Agent and Tool Ecosystem SWE-Bench Lite SWE-Bench Verified SHERLOC

5arXiv · cs.LG·Jun 19, 2026·source ↗

Probe-and-Refine Tuning improves coding agent performance via iterative repository guidance refinement

A new arXiv paper introduces probe-and-refine tuning, a procedure that uses synthetic bug-fix probes to iteratively improve AGENTS.md repository guidance files for LLM-based coding agents without requiring an agent loop during tuning. Evaluated on SWE-bench Verified with Qwen3.5-35B-A3B, the method achieves 33.0% mean resolve rate versus 28.3% for a static knowledge base baseline and 25.5% for an unguided baseline. The improvement is attributed to coverage gains—refined guidance helps agents locate the correct files rather than improving patch quality—and a step-budget experiment shows guidance is necessary for agents to productively use larger compute budgets.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen3.5-35B-A3B SWE-Bench Verified NVIDIA Nemotron-3-Nano-30B-A3B +2 more

5arXiv · cs.CL·Jun 11, 2026·source ↗

Claw-SWE-Bench: A benchmark for evaluating agent harnesses on multilingual coding tasks

Researchers introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol designed to fairly compare heterogeneous agent harnesses ("claws") on GitHub issue-resolution tasks. The benchmark contains 350 instances across 8 languages and 43 repositories, with an 80-instance Lite subset for cost-efficient validation. Key findings show adapter design dominates raw model choice: a minimal adapter scores 19.1% Pass@1 versus 73.4% for a full adapter using the same GLM 5.1 backbone, and harness choice and model choice each shift Pass@1 by roughly 27-29 percentage points. The work also introduces cost accounting as a first-class evaluation axis alongside accuracy.

Evaluation and Benchmarking Inference Economics SWE-Bench Multilingual OpenClaw SWE-Bench Verified +4 more

6arXiv · cs.AI·Jun 10, 2026·source ↗

Frontier coding agents use metaprogramming to handle esoteric programming languages

A new arXiv paper evaluates six LLM-based coding agents on four esoteric programming languages (including Brainfuck and Befunge-98), finding that the strongest agents—Claude Opus 4.6 and GPT-5.4 xhigh—often avoid writing the target language directly, instead generating it via Python metaprograms. Forbidding this strategy causes large performance drops, and text guidance alone does not transfer the capability to weaker models, though sharing Opus-derived Python helper code does sharply improve mid-tier agents. The study reveals capability stratification that mainstream benchmarks like SWE-Bench Verified compress into narrow bands, suggesting frontier agents succeed by constructing and debugging working models of unfamiliar environments rather than pattern-matching to training data.

Frontier Model Releases Evaluation and Benchmarking Claude Sonnet 4 Claude Opus 4.6 SWE-Bench Verified +8 more

7Anthropic News·Jun 3, 2026·source ↗

Claude 3.5 Sonnet begins rollout on GitHub Copilot via Amazon Bedrock

Anthropic's Claude 3.5 Sonnet is now rolling out on GitHub Copilot, available in public preview for all Copilot Chat users in Visual Studio Code and GitHub.com. The model claims top performance on SWE-bench Verified among publicly available models and 93.7% on HumanEval. The integration runs via Amazon Bedrock's cross-region inference and reaches GitHub's community of over 100 million developers, representing a significant distribution milestone for Claude.

Frontier Model Releases Enterprise Deployment Patterns Amazon Bedrock Microsoft GitHub +7 more

9Anthropic News·Jun 3, 2026·source ↗

Anthropic introduces computer use capability, upgraded Claude 3.5 Sonnet, and Claude 3.5 Haiku

Anthropic announced three major developments: an upgraded Claude 3.5 Sonnet with significant coding improvements (SWE-bench Verified rising from 33.4% to 49.0%, surpassing all publicly available models including reasoning models), a new Claude 3.5 Haiku that matches Claude 3 Opus performance at Haiku-tier speed, and a public beta of 'computer use' — a capability allowing Claude to control computers by viewing screens, moving cursors, clicking, and typing. Computer use is available via the Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI, with early adopters including Replit, The Browser Company, and Cognition. Both safety institutes (US AISI and UK AISI) conducted pre-deployment testing, and the model was assessed as remaining within ASL-2 under Anthropic's Responsible Scaling Policy.

Frontier Model Releases Evaluation and Benchmarking OpenAI o1-preview Amazon Bedrock Claude 3.5 Sonnet +15 more

7Anthropic News·Jun 2, 2026·source ↗

Claude Opus 4.1 Released with 74.5% SWE-bench Verified Score

Anthropic has released Claude Opus 4.1, an incremental upgrade to Claude Opus 4 focused on agentic tasks, coding, and reasoning. The model achieves 74.5% on SWE-bench Verified (without extended thinking) and shows notable gains in multi-file code refactoring and large-codebase debugging. It is available to paid Claude users, Claude Code, and via API on Anthropic, Amazon Bedrock, and Google Cloud Vertex AI at the same price as Opus 4. Anthropic notes substantially larger model improvements are planned for the coming weeks.

Frontier Model Releases Evaluation and Benchmarking Rakuten Group Amazon Bedrock Claude Opus 4.6 +9 more

9Anthropic News·Jun 1, 2026·source ↗

Anthropic Introduces Claude Opus 4 and Sonnet 4 with Leading Coding Benchmarks and Agent Capabilities

Anthropic has released Claude Opus 4 and Claude Sonnet 4, positioning Opus 4 as the world's best coding model with 72.5% on SWE-bench and 43.2% on Terminal-bench, and Sonnet 4 at 72.7% on SWE-bench. Both models are hybrid (near-instant + extended thinking), support extended thinking with tool use in beta, parallel tool execution, and improved memory via local file access. Alongside the models, Anthropic is launching Claude Code as generally available with GitHub Actions, VS Code, and JetBrains integrations, plus four new API capabilities: code execution tool, MCP connector, Files API, and one-hour prompt caching. Pricing is unchanged from prior Opus and Sonnet tiers ($15/$75 and $3/$15 per million tokens respectively), with availability on Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI.

Long Context Evolution Frontier Model Releases Claude Sonnet 4 Amazon Bedrock Claude Opus 4.6 +21 more

9Anthropic News·Jun 1, 2026·source ↗

Claude 3.7 Sonnet and Claude Code: Anthropic's First Hybrid Reasoning Model and Agentic Coding Tool

Anthropic has released Claude 3.7 Sonnet, described as their most capable model to date and the first hybrid reasoning model on the market, capable of operating in both standard and extended thinking modes within a single unified model. The model achieves state-of-the-art results on SWE-bench Verified and TAU-bench, with particular strength in coding and front-end web development. Alongside the model, Anthropic is launching Claude Code in limited research preview, a command-line agentic coding tool that can read/edit files, run tests, and push to GitHub. Pricing remains unchanged at $3/M input and $15/M output tokens, with availability across Claude.ai plans, Amazon Bedrock, and Google Cloud Vertex AI.

Frontier Model Releases Evaluation and Benchmarking Canva Amazon Bedrock GitHub +14 more

7Anthropic News·Jun 1, 2026·source ↗

Anthropic Launches Claude Haiku 4.5: Near-Frontier Performance at $1/$5 per Million Tokens

Anthropic has released Claude Haiku 4.5, a small model priced at $1/$5 per million input/output tokens that delivers coding performance comparable to Claude Sonnet 4 at one-third the cost and more than twice the speed. The model surpasses Sonnet 4 on computer use tasks and achieves 90% of Sonnet 4.5's performance on agentic coding evaluations, running 4-5x faster than Sonnet 4.5. Notably, Haiku 4.5 is classified under ASL-2 safety standards—less restrictive than the ASL-3 applied to Sonnet 4.5 and Opus 4.1—and is described as Anthropic's safest model by automated alignment metrics. It is available via the Claude API, Amazon Bedrock, and Google Cloud Vertex AI.

Frontier Model Releases Evaluation and Benchmarking Claude Sonnet 4 Amazon Bedrock Claude Opus 4.6 +15 more

9Anthropic News·Jun 1, 2026·source ↗

Anthropic Releases Claude Sonnet 4.5: Top Coding and Computer-Use Model with Agent SDK

Anthropic has released Claude Sonnet 4.5, claiming it is the best coding model and strongest model for building complex agents, with a 61.4% score on OSWorld (up from 42.2% for Sonnet 4) and state-of-the-art performance on SWE-bench Verified. The release is accompanied by major product upgrades including checkpoints in Claude Code, a native VS Code extension, a Claude Agent SDK giving developers access to the same infrastructure powering Claude Code, and new context editing and memory tools in the Claude API. Pricing is unchanged from Sonnet 4 at $3/$15 per million input/output tokens. Early enterprise customers including Cursor, GitHub Copilot, Devin, Canva, and Figma report significant gains in coding, agentic, and long-context tasks.

Frontier Model Releases Evaluation and Benchmarking Canva Claude for Chrome Figma +13 more

7Mistral Ai News·Jun 1, 2026·source ↗

Mistral AI Releases Devstral: Apache 2.0 Agentic Coding Model with SWE-Bench SOTA

Mistral AI, in collaboration with All Hands AI, releases Devstral, an agentic LLM specialized for software engineering tasks under the Apache 2.0 license. The model achieves 46.8% on SWE-Bench Verified, surpassing prior open-source state-of-the-art by over 6 percentage points and outperforming larger models like DeepSeek-V3-0324 (671B) and Qwen3 232B-A22B under the same OpenHands scaffold. Devstral is small enough to run on a single RTX 4090 or a Mac with 32GB RAM, and is available via Mistral's API at $0.1/M input tokens, as well as on HuggingFace, Ollama, and other platforms. Mistral indicates a larger agentic coding model is in development.

Frontier Model Releases Evaluation and Benchmarking DeepSeek-V3-0324 Mistral AI GPT-4.1 mini +10 more

7Mistral Ai News·Jun 1, 2026·source ↗

Mistral AI Releases Devstral Medium and Devstral Small 1.1 for Agentic Coding

Mistral AI, in collaboration with All Hands AI, has released two new agentic coding models: Devstral Small 1.1 (24B parameters, Apache 2.0, 53.6% on SWE-Bench Verified) and Devstral Medium (61.6% on SWE-Bench Verified, API-only). Devstral Medium is positioned as a cost-performance leader, claiming to surpass Gemini 2.5 Pro and GPT-4.1 at roughly one-quarter the price, priced at $0.4/M input and $2/M output tokens. Devstral Small 1.1 sets a new state-of-the-art among open models for code agents without test-time scaling, and supports both Mistral function calling and XML formats for broad agentic scaffold compatibility.

Frontier Model Releases Evaluation and Benchmarking Devstral 2 Small Mistral AI All Hands AI +10 more

7Mistral Ai News·Jun 1, 2026·source ↗

Mistral Announces Codestral 25.08 and Integrated Enterprise Coding Stack

Mistral AI has released Codestral 25.08, a code generation model update claiming +30% accepted completions, 50% fewer runaway generations, and improved FIM benchmark performance. The announcement also frames a full enterprise coding stack comprising Codestral (completion), Codestral Embed (code-specific retrieval), and Devstral (agentic workflows via OpenHands), all deployable on-prem or in VPC environments. Devstral Medium is reported to achieve 61.6% on SWE-Bench Verified, while Devstral Small (24B, Apache-2.0) reaches 53.6%. The pitch targets regulated industries blocked by SaaS-only competitors through self-hostable, air-gapped deployment options.

Frontier Model Releases Evaluation and Benchmarking Devstral 2 Small Fill-in-the-Middle (FIM)Mistral AI +13 more

7arXiv · cs.CL·May 26, 2026·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

Frontier Model Releases Evaluation and Benchmarking NeurIPS Auto Benchmark Audit (ABA)SWE-Bench Verified +2 more

6Openai Blog·May 20, 2026·source ↗

Introducing SWE-bench Verified

OpenAI is releasing SWE-bench Verified, a human-validated subset of the SWE-bench benchmark designed to more reliably evaluate AI models on real-world software engineering tasks. The original SWE-bench contained issues that were ambiguous or unsolvable, leading to unreliable scores; the Verified subset addresses this by having human annotators confirm task solvability and clarity. This provides a cleaner signal for comparing coding agent performance across labs.

Evaluation and Benchmarking Agent and Tool Ecosystem SWE-Bench Verified SWE-bench OpenAI

7Openai Blog·May 20, 2026·source ↗

OpenAI Abandons SWE-bench Verified Over Contamination and Measurement Flaws

OpenAI has announced it will no longer evaluate models on SWE-bench Verified, citing benchmark contamination and flawed test cases that cause it to mismeasure frontier coding capabilities. Their analysis identified both problematic test design and training data leakage as sources of unreliability. OpenAI recommends SWE-bench Pro as a replacement benchmark for evaluating coding progress.

Frontier Model Releases Evaluation and Benchmarking SWE-Bench Verified SWE-bench OpenAI +1 more

8Mistral Ai News·May 18, 2026·source ↗

Mistral Releases Devstral 2 (123B) and Devstral Small 2 (24B) Coding Models Plus Vibe CLI Agent

Mistral AI has released Devstral 2, a 123B-parameter open-weight coding model scoring 72.2% on SWE-bench Verified, and Devstral Small 2, a 24B model scoring 68.0% on the same benchmark and deployable on consumer hardware. Both models support a 256K context window and are permissively licensed (modified MIT and Apache 2.0 respectively). Mistral also launched Vibe CLI, an open-source terminal-based coding agent powered by Devstral that supports multi-file orchestration, natural language code editing, and IDE integration via Agent Communication Protocol. Devstral 2 is currently free via API with post-free pricing of $0.40/$2.00 per million tokens input/output.

Long Context Evolution Frontier Model Releases Devstral 2 Small Mistral AI Kimi K2 +13 more

6Deepseek News·May 18, 2026·source ↗

DeepSeek-V2.5: Merged Open-Source Model Combining General and Coding Capabilities

DeepSeek has released DeepSeek-V2.5, an open-source model that merges DeepSeek-V2-Chat-0628 and DeepSeek-Coder-V2-0724 into a single unified model. The release improves general conversational capabilities, coding performance, instruction-following, and writing tasks while also strengthening safety properties—raising the overall safety score from 74.4% to 82.6% and reducing safety spillover rate from 11.3% to 4.6%. The model is available via backward-compatible API endpoints (deepseek-chat and deepseek-coder) and on HuggingFace, retaining features like Function Calling, FIM completion, and JSON output. Benchmark results show improvements on HumanEval Python and LiveCodeBench, though SWE-verified performance remains an acknowledged weak area.

Frontier Model Releases Evaluation and Benchmarking DeepSeek-V2-Chat-0628 DeepSeek V4 SWE-Bench Verified +8 more

8Mistral Ai News·May 18, 2026·source ↗

Mistral Launches Medium 3.5 (128B Open Weights), Remote Cloud Coding Agents in Vibe, and Work Mode in Le Chat

Mistral AI has released Mistral Medium 3.5, a 128B dense open-weights model with a 256k context window, configurable reasoning effort, and a vision encoder trained from scratch, scoring 77.6% on SWE-Bench Verified. Alongside the model, Mistral is launching remote cloud-based coding agents in its Vibe CLI and Le Chat interface, enabling async parallel coding sessions that run independently and notify users on completion. A new Work mode in Le Chat provides a multi-step agentic interface for cross-tool workflows including email, calendar, research, and issue tracking. Mistral Medium 3.5 replaces Devstral 2 as the default model in both Le Chat and the Vibe CLI, and is available for self-hosting on as few as four GPUs under a modified MIT license.

Long Context Evolution Frontier Model Releases Mistral AI Qwen3.5 397B A17B Devstral 2 +10 more