benchmark
MacAgentBench
benchmarkactiveprovisional
macagentbench-adb3825a·1 events·first seen 11h agoAliases: MacAgentBench
Co-occurring entities
More like this (12)
Recent events (1)
MacAgentBench: New benchmark for AI agents on real-world macOS desktop tasks
MacAgentBench introduces a 676-task benchmark across 25 macOS applications designed to evaluate computer use agents (CUAs) with framework augmentation and fine-grained multi-checkpoint scoring, addressing gaps in existing binary-evaluation benchmarks. Nearly 60% of tasks involve both GUI and CLI interaction, and the benchmark tests 16 models across three agent frameworks. The best result — Claude Opus 4.6 on the OpenClaw framework — achieves 73.7% Pass@1, with performance gains attributed primarily to the skill library rather than framework design. Fine-grained metrics reveal that models with similar Pass@1 scores can differ substantially in sub-goal completion, highlighting limitations of coarse evaluation.