Almanac
benchmark

SWE-Bench-Pro-Hard-AA

benchmarkactiveprovisionalswe-bench-pro-hard-aa-8d34028a·1 events·first seen 4d ago

Aliases: SWE-Bench-Pro-Hard-AA

Co-occurring entities

More like this (12)

Recent events (1)

6The Batch·4d ago·source ↗

Cursor's Composer 2.5 rivals GPT-5.5 and Claude Opus 4.7 on coding benchmarks at lower cost

Cursor released Composer 2.5, a specialized agentic coding model built on Moonshot's Kimi K2.5 open weights with additional pretraining and reinforcement learning fine-tuning tailored to Cursor's own CLI harness. The model ranks third on the Artificial Analysis Coding Agent Index behind Claude Opus 4.7 and GPT-5.5 at max reasoning, but significantly undercuts them on cost ($0.44 vs $4.14 per task) and speed (6.7 vs 17.7 minutes). The training approach—co-optimizing model and harness together using synthetic tasks, text feedback during RL, and 25x more synthetic data than Composer 2—illustrates a specialist model strategy that challenges the dominance of generalist frontier models in coding workflows.