Entity · benchmark

SWE-Bench Multilingual

benchmarkactiveswe-bench-multilingual-7c8e25af·5 events·first seen May 20, 2026

Aliases: SWE-Bench Multilingual, SWE-bench-Multilingual

Co-occurring entities

More like this (12)

SWE-bench SWE-Bench Lite SWE-Bench Verified SWE-Bench-Pro-Hard-AA SpecBench ESI-Bench MSE-Bench Claw-SWE-Bench SkillsBench SpatialBench multilingual mathematical benchmarks MT-Bench

Recent events (5)

6arXiv · cs.AI·30h ago·source ↗

PAIChecker finds 13.6% misalignment in SWE-bench Verified instances, proposes multi-agent fix

A new arXiv paper systematically audits SWE-bench Verified and finds that 13.6% of PR-Issue pairings exhibit misalignment across five patterns and eleven fine-grained scenarios, undermining the benchmark's validity as an LLM evaluation tool. The authors introduce PAIChecker, a three-phase multi-agent system for detecting such misalignment, achieving up to 92.12% binary accuracy on SWE-Gym and 91.67% on SWE-bench Multilingual. The finding is significant because SWE-bench is one of the most widely cited benchmarks for agentic coding capability, and systematic data quality issues could distort leaderboard rankings and capability claims.

Evaluation and Benchmarking Agent and Tool Ecosystem SWE-Gym SWE-Bench Multilingual SWE-Bench Verified +1 more

6arXiv · cs.LG·2d ago·source ↗

MindForge pipeline fine-tunes small models for whole-life-cycle software engineering via source-free program synthesis

MindForge is an automated pipeline that converts open-source command-line programs into source-free training environments exposing only compiled executables and documentation, enabling training data generation for from-scratch program synthesis. Using GLM-5.2 as a teacher agent, the authors fine-tune Qwen3.6-27B on synthesized trajectories, raising its ProgramBench pass rate from 37.98% to 49.51% and achieving gains across seven held-out benchmarks including SWE-bench Verified (+5.04) and RepoZero-C2Rust (+31.00). The work addresses a gap in coding agent training infrastructure by spanning the full software engineering life cycle rather than single-phase tasks. The result is notable for achieving frontier-comparable performance on a 27B model through targeted data curation.

Evaluation and Benchmarking Open Weights Progress FeatBench MindForge NL2Repo-Bench +9 more

5arXiv · cs.CL·Jun 11, 2026·source ↗

Claw-SWE-Bench: A benchmark for evaluating agent harnesses on multilingual coding tasks

Researchers introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol designed to fairly compare heterogeneous agent harnesses ("claws") on GitHub issue-resolution tasks. The benchmark contains 350 instances across 8 languages and 43 repositories, with an 80-instance Lite subset for cost-efficient validation. Key findings show adapter design dominates raw model choice: a minimal adapter scores 19.1% Pass@1 versus 73.4% for a full adapter using the same GLM 5.1 backbone, and harness choice and model choice each shift Pass@1 by roughly 27-29 percentage points. The work also introduces cost accounting as a first-class evaluation axis alongside accuracy.

Evaluation and Benchmarking Inference Economics SWE-Bench Multilingual OpenClaw SWE-Bench Verified +4 more

6The Batch·Jun 2, 2026·source ↗

MiniMax M2.7 proprietary reasoning model competes with Gemini and Claude Opus; roundup covers Cursor Composer 2, MAI-Image-2, Claude Code Channels, and Anthropic defense dispute

MiniMax released M2.7, a proprietary reasoning model that achieved 66.6% on MLE Bench Lite (tying Gemini 3.1) and 56.22% on SWE-Pro, priced at $0.30/$1.20 per million tokens, with the shift to proprietary marking a potential strategic pivot among Chinese AI labs away from open weights. Cursor released Composer 2, an agentic coding model built on a fine-tuned Kimi 2.5 (via Moonshot partnership), priced 86% cheaper than its predecessor and scoring 73.7 on SWE-bench Multilingual. Anthropic released Claude Code Channels, routing Telegram and Discord messages into local Claude Code sessions via MCP plugins, and separately filed a court response denying it has any backdoor or kill switch into military deployments of Claude. Microsoft announced MAI-Image-2, a text-to-image model ranking third on Arena.ai among research labs.

Frontier Model Releases Open Weights Progress Stitch Claude Sonnet 4 SWE-Pro +17 more

6The Batch·May 20, 2026·source ↗

Data Points: Cursor Composer 2.5, Gemini 3.5 Flash, Antigravity 2.0, Omni Flash, AI Search, and Corti Symphony

This edition covers several notable AI product and model releases: Cursor shipped Composer 2.5 (built on Kimi K2.5) scoring 79.8% on SWE-Bench Multilingual at significantly lower cost than frontier competitors; Google released Gemini 3.5 Flash with claimed 4x speed advantage and launched Antigravity 2.0 as an agent-first desktop app replacing its IDE; Google also introduced Gemini Omni Flash for multimodal video generation and overhauled its search interface with Gemini 3.5. Additionally, Copenhagen-based Corti launched Symphony for Speech-to-Text achieving 1.4% word error rate on medical terminology versus 17-19% for generalist models.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.5 Pro Gemini Spark Cursor Composer 2.5 +23 more