Almanac
benchmark

MacAgentBench

benchmarkactiveprovisionalmacagentbench-adb3825a·1 events·first seen 11h ago

Aliases: MacAgentBench

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·11h ago·source ↗

MacAgentBench: New benchmark for AI agents on real-world macOS desktop tasks

MacAgentBench introduces a 676-task benchmark across 25 macOS applications designed to evaluate computer use agents (CUAs) with framework augmentation and fine-grained multi-checkpoint scoring, addressing gaps in existing binary-evaluation benchmarks. Nearly 60% of tasks involve both GUI and CLI interaction, and the benchmark tests 16 models across three agent frameworks. The best result — Claude Opus 4.6 on the OpenClaw framework — achieves 73.7% Pass@1, with performance gains attributed primarily to the skill library rather than framework design. Fine-grained metrics reveal that models with similar Pass@1 scores can differ substantially in sub-goal completion, highlighting limitations of coarse evaluation.