XinhaoS0101
xinhaos0101-c39ad925·1 events·first seen 15d agoAliases: XinhaoS0101
Co-occurring entities
More like this (12)
Recent events (1)
HLL: Benchmark for Evaluating Multimodal Agents on CAPTCHA Human-Verification Boundaries
The paper introduces Humanity's Last Line of Verification (HLL), a controlled benchmark that tests whether multimodal agents can solve CAPTCHA challenges through grounded, human-like GUI interaction rather than mere recognition. Eight frontier multimodal agents are evaluated in a closed-loop environment across diverse CAPTCHA types with realism stressors including cluttered interfaces, harder variants, and trace-conditioned validation. Results show current agents remain brittle at this human-substitution boundary, with performance degrading under realistic conditions and when action traces must be consistent with correct answers. The benchmark exposes specific gaps in localization, action calibration, state tracking, and process consistency.