benchmark
RealClawBench
benchmarkactiveprovisional
realclawbench-9c381d0d·1 events·first seen 13d agoAliases: RealClawBench
Co-occurring entities
More like this (12)
Recent events (1)
RealClawBench: Live benchmark framework built from real developer-agent sessions
RealClawBench is a new benchmark framework that converts real OpenClaw developer-agent sessions into reproducible, automatically scored evaluation tasks. It addresses realism gaps in existing agent benchmarks through reconstructed execution environments and deterministic verifiable scorers, releasing 281 executable tasks sampled to preserve the source session distribution. Evaluation of 14 contemporary models shows the best system solves only 65.8% of tasks, indicating substantial headroom on realistic developer-agent workloads.