Almanac
other

multimodal agents

otheractiveprovisionalmultimodal-agents-66fb1ae5·1 events·first seen 15d ago

Aliases: multimodal agents

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·15d ago·source ↗

HLL: Benchmark for Evaluating Multimodal Agents on CAPTCHA Human-Verification Boundaries

The paper introduces Humanity's Last Line of Verification (HLL), a controlled benchmark that tests whether multimodal agents can solve CAPTCHA challenges through grounded, human-like GUI interaction rather than mere recognition. Eight frontier multimodal agents are evaluated in a closed-loop environment across diverse CAPTCHA types with realism stressors including cluttered interfaces, harder variants, and trace-conditioned validation. Results show current agents remain brittle at this human-substitution boundary, with performance degrading under realistic conditions and when action traces must be consistent with correct answers. The benchmark exposes specific gaps in localization, action calibration, state tracking, and process consistency.