Almanac
technique

CoCoA-1MCA

techniqueactiveprovisionalcocoa-1mca-f9e0b014·1 events·first seen 9h ago

Aliases: CoCoA-1MCA

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·9h ago·source ↗

Argus benchmark evaluates uncertainty quantification methods for computer-use GUI agents across VLMs and datasets

Researchers introduce Argus, a cross-regime benchmark for post-hoc uncertainty quantification (UQ) in single-step GUI grounding agents, covering 27 methods across 4 open-weight VLMs and 4 datasets, plus an 8-method closed-source matrix across 3 frontier vendors. The central finding is 'selective transfer': UQ rankings are stable across datasets for a fixed model but degrade across model classes and observable interfaces, with cross-tier transfer to closed-source vendors averaging only +0.08 Spearman correlation. Hidden-state and density methods prove most stable for open-weight models, while conformal click regions reveal that score-level discrimination alone is insufficient for deployment safety. The benchmark releases per-item records and analysis scripts to support regime-aware UQ selection in GUI agents.