Almanac
benchmark

RealClawBench

benchmarkactiveprovisionalrealclawbench-9c381d0d·1 events·first seen 13d ago

Aliases: RealClawBench

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·13d ago·source ↗

RealClawBench: Live benchmark framework built from real developer-agent sessions

RealClawBench is a new benchmark framework that converts real OpenClaw developer-agent sessions into reproducible, automatically scored evaluation tasks. It addresses realism gaps in existing agent benchmarks through reconstructed execution environments and deterministic verifiable scorers, releasing 281 executable tasks sampled to preserve the source session distribution. Evaluation of 14 contemporary models shows the best system solves only 65.8% of tasks, indicating substantial headroom on realistic developer-agent workloads.