Entity · benchmark

SWE-Bench-Pro-Hard-AA

benchmarkactiveswe-bench-pro-hard-aa-8d34028a·1 events·first seen Jun 12, 2026

Aliases: SWE-Bench-Pro-Hard-AA

Co-occurring entities

Claude Opus 4.6 SpaceX Artificial Analysis Coding Agent Index Cursor CursorBench Claude Code OpenAI Composer 2 Kimi K2.5 Moonshot AI Codex GPT-5.5 Anthropic

More like this (12)

SWE-Bench Lite ITBench-AA SWE-bench SWE-Bench Verified Claw-SWE-Bench ATE-Bench SWE-Bench Multilingual SorryBench AutomationBench-AA SWE-Pro FeatBench AdvBench

Recent events (1)

6The Batch·Jun 12, 2026·source ↗

Cursor's Composer 2.5 rivals GPT-5.5 and Claude Opus 4.7 on coding benchmarks at lower cost

Cursor released Composer 2.5, a specialized agentic coding model built on Moonshot's Kimi K2.5 open weights with additional pretraining and reinforcement learning fine-tuning tailored to Cursor's own CLI harness. The model ranks third on the Artificial Analysis Coding Agent Index behind Claude Opus 4.7 and GPT-5.5 at max reasoning, but significantly undercuts them on cost ($0.44 vs $4.14 per task) and speed (6.7 vs 17.7 minutes). The training approach—co-optimizing model and harness together using synthetic tasks, text feedback during RL, and 25x more synthetic data than Composer 2—illustrates a specialist model strategy that challenges the dominance of generalist frontier models in coding workflows.

Frontier Model Releases Inference Economics SWE-Bench-Pro-Hard-AA Claude Opus 4.6 SpaceX +12 more