SWE-Bench Multilingual
swe-bench-multilingual-7c8e25af·3 events·first seen 27d agoAliases: SWE-Bench Multilingual, SWE-bench-Multilingual
Co-occurring entities
More like this (12)
Recent events (3)
Data Points: Cursor Composer 2.5, Gemini 3.5 Flash, Antigravity 2.0, Omni Flash, AI Search, and Corti Symphony
This edition covers several notable AI product and model releases: Cursor shipped Composer 2.5 (built on Kimi K2.5) scoring 79.8% on SWE-Bench Multilingual at significantly lower cost than frontier competitors; Google released Gemini 3.5 Flash with claimed 4x speed advantage and launched Antigravity 2.0 as an agent-first desktop app replacing its IDE; Google also introduced Gemini Omni Flash for multimodal video generation and overhauled its search interface with Gemini 3.5. Additionally, Copenhagen-based Corti launched Symphony for Speech-to-Text achieving 1.4% word error rate on medical terminology versus 17-19% for generalist models.
MiniMax M2.7 proprietary reasoning model competes with Gemini and Claude Opus; roundup covers Cursor Composer 2, MAI-Image-2, Claude Code Channels, and Anthropic defense dispute
MiniMax released M2.7, a proprietary reasoning model that achieved 66.6% on MLE Bench Lite (tying Gemini 3.1) and 56.22% on SWE-Pro, priced at $0.30/$1.20 per million tokens, with the shift to proprietary marking a potential strategic pivot among Chinese AI labs away from open weights. Cursor released Composer 2, an agentic coding model built on a fine-tuned Kimi 2.5 (via Moonshot partnership), priced 86% cheaper than its predecessor and scoring 73.7 on SWE-bench Multilingual. Anthropic released Claude Code Channels, routing Telegram and Discord messages into local Claude Code sessions via MCP plugins, and separately filed a court response denying it has any backdoor or kill switch into military deployments of Claude. Microsoft announced MAI-Image-2, a text-to-image model ranking third on Arena.ai among research labs.
Claw-SWE-Bench: A benchmark for evaluating agent harnesses on multilingual coding tasks
Researchers introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol designed to fairly compare heterogeneous agent harnesses ("claws") on GitHub issue-resolution tasks. The benchmark contains 350 instances across 8 languages and 43 repositories, with an 80-instance Lite subset for cost-efficient validation. Key findings show adapter design dominates raw model choice: a minimal adapter scores 19.1% Pass@1 versus 73.4% for a full adapter using the same GLM 5.1 backbone, and harness choice and model choice each shift Pass@1 by roughly 27-29 percentage points. The work also introduces cost accounting as a first-class evaluation axis alongside accuracy.