Entity · benchmark

Claw-SWE-Bench

benchmarkactiveclaw-swe-bench-a683135a·1 events·first seen Jun 11, 2026

Aliases: Claw-SWE-Bench

Co-occurring entities

SWE-Bench Multilingual OpenClaw SWE-Bench Verified SWE-bench GLM-5.1

More like this (12)

ClawBench RealClawBench UniClawBench QwenClawBench SWE-bench SWE-Bench Lite Claw-Anything EnterpriseClawBench Claw-Eval SWE-Bench-Pro-Hard-AA SWE-Bench Verified ScarfBench

Recent events (1)

5arXiv · cs.CL·Jun 11, 2026·source ↗

Claw-SWE-Bench: A benchmark for evaluating agent harnesses on multilingual coding tasks

Researchers introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol designed to fairly compare heterogeneous agent harnesses ("claws") on GitHub issue-resolution tasks. The benchmark contains 350 instances across 8 languages and 43 repositories, with an 80-instance Lite subset for cost-efficient validation. Key findings show adapter design dominates raw model choice: a minimal adapter scores 19.1% Pass@1 versus 73.4% for a full adapter using the same GLM 5.1 backbone, and harness choice and model choice each shift Pass@1 by roughly 27-29 percentage points. The work also introduces cost accounting as a first-class evaluation axis alongside accuracy.

Evaluation and Benchmarking Inference Economics SWE-Bench Multilingual OpenClaw SWE-Bench Verified +4 more