benchmark
Claw-SWE-Bench
benchmarkactiveprovisional
claw-swe-bench-a683135a·1 events·first seen 6d agoAliases: Claw-SWE-Bench
Co-occurring entities
More like this (12)
Recent events (1)
Claw-SWE-Bench: A benchmark for evaluating agent harnesses on multilingual coding tasks
Researchers introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol designed to fairly compare heterogeneous agent harnesses ("claws") on GitHub issue-resolution tasks. The benchmark contains 350 instances across 8 languages and 43 repositories, with an 80-instance Lite subset for cost-efficient validation. Key findings show adapter design dominates raw model choice: a minimal adapter scores 19.1% Pass@1 versus 73.4% for a full adapter using the same GLM 5.1 backbone, and harness choice and model choice each shift Pass@1 by roughly 27-29 percentage points. The work also introduces cost accounting as a first-class evaluation axis alongside accuracy.