Almanac
benchmark

EnterpriseClawBench

benchmarkactiveprovisionalenterpriseclawbench-eb88aa54·1 events·first seen 4h ago

Aliases: EnterpriseClawBench

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·4h ago·source ↗

EnterpriseClawBench: A benchmark for enterprise agents derived from real workplace sessions

Researchers introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary real-world workplace sessions, yielding 852 reproducible tasks with fixtures, prompts, role classes, skill subclasses, and semantic rubrics. Because the sessions contain internal enterprise content, the benchmark data is not publicly released, but the construction and evaluation protocol is the reusable contribution. The best evaluated configuration (Codex with GPT-5.5) achieves only 0.663, indicating substantial headroom. The paper argues enterprise agent evaluation must report harness-model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior rather than collapsing to a single score.