6arXiv cs.CL (Computation and Language)·10d ago

HiViG: History-aware visually grounded critic improves computer use agents across GUI benchmarks

Researchers introduce HiViG, a test-time framework for Computer Use Agents that addresses two weaknesses in existing critic models: short-sighted decision loops and lack of visual grounding. The system trains a multimodal critic on real GUI trajectories to maintain a compact macro-action history and verify execution coordinates against live screenshots before action execution. Evaluated on web, mobile, and desktop benchmarks, HiViG improves average success rates by 5.8% over the strongest baseline with Qwen3-VL-32B and 9.0% with Gemini-3-Flash, with both history and grounding components shown to be independently necessary.

Evaluation and Benchmarking Agent and Tool Ecosystem Multimodal Progress HiViG A History-Aware Visually Grounded Critic for Computer Use Agents Gemini 3 Flash Qwen3 32B

Related guides (3)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·10d ago·source ↗

VISTA: Hybrid user simulation toolkit for interactive agent evaluation

Researchers introduce VISTA, a user simulation framework designed to address limitations in current agent evaluation methods, which rely on static benchmarks that miss dynamic, multi-step failure modes. VISTA provides six metrics for measuring realism, capability coverage, and interaction effectiveness, and combines UI-based and API-based interactions in a hybrid simulator. The toolkit is evaluated in e-commerce and education customer service settings, showing more realistic and comprehensive coverage than existing approaches.

Evaluation and Benchmarking Agent and Tool Ecosystem VISTA

5Hugging Face Blog·1mo ago·source ↗

ScreenSuite: Comprehensive Evaluation Suite for GUI Agents

Hugging Face has released ScreenSuite, described as the most comprehensive evaluation suite for GUI (Graphical User Interface) agents. The suite aims to standardize and broaden benchmarking for agents that interact with visual interfaces. This addresses a gap in the evaluation ecosystem for screen-based AI agents, which are increasingly relevant as agentic systems expand into desktop and web automation tasks.

Evaluation and Benchmarking Agent and Tool Ecosystem GUI Agents ScreenSuite Hugging Face

5Hugging Face Blog·1mo ago·source ↗

Holo1: New family of GUI automation VLMs powering GUI agent Surfer-H

H Company has released Holo1, a new family of vision-language models specifically designed for GUI automation tasks. These models power Surfer-H, a GUI agent capable of interacting with graphical interfaces. The release represents a specialized VLM family targeting the agent-tool ecosystem for desktop/web automation. Details on architecture, training data, and benchmarks are expected in the accompanying blog post.

Agent and Tool Ecosystem Multimodal Progress Surfer-H Hugging Face Holo1 +1 more

6arXiv · cs.CL·18d ago·source ↗

HLL: Benchmark for Evaluating Multimodal Agents on CAPTCHA Human-Verification Boundaries

The paper introduces Humanity's Last Line of Verification (HLL), a controlled benchmark that tests whether multimodal agents can solve CAPTCHA challenges through grounded, human-like GUI interaction rather than mere recognition. Eight frontier multimodal agents are evaluated in a closed-loop environment across diverse CAPTCHA types with realism stressors including cluttered interfaces, harder variants, and trace-conditioned validation. Results show current agents remain brittle at this human-substitution boundary, with performance degrading under realistic conditions and when action traces must be consistent with correct answers. The benchmark exposes specific gaps in localization, action calibration, state tracking, and process consistency.

Evaluation and Benchmarking Enterprise Deployment Patterns multimodal agents GUI grounding XinhaoS0101 +4 more

6arXiv · cs.CL·23d ago·source ↗

VisualMem: Personal Visual Memory Benchmark and Architecture for Personalized AI Agents

This paper introduces a benchmark and hybrid architecture (VisualMem) for personal visual memory in long-term AI agent memory systems. The work addresses a gap in existing text-centric memory systems by capturing both explicit evidence (recurring user-associated entities) and implicit evidence (latent user facts from visual/multimodal cues) from images. VisualMem augments a text-memory backend with a structured personal visual memory module that uses conversational context to resolve identity, ownership, and durable user facts. Experiments show VisualMem substantially outperforms prior memory systems on the new benchmark while remaining competitive on standard text-memory benchmarks.

Long Context Evolution Evaluation and Benchmarking VisualMem long-term memory Personal Visual Memory Benchmark +3 more

7arXiv · cs.CL·25d ago·source ↗

MobileGym: Verifiable Parallel Simulation Platform for Mobile GUI Agent Training

MobileGym is a browser-hosted simulation environment for mobile GUI agent research that enables deterministic outcome verification via structured JSON state and scalable online RL through hundreds of parallel instances (~400 MB/instance, ~3s cold start). The accompanying MobileGym-Bench provides 416 parameterized task templates across 28 apps with deterministic judges. A sim-to-real case study using GRPO on Qwen3-VL-4B-Instruct achieves +12.8 percentage points on the 256-task test set, with real-device execution retaining 95.1% of simulation-side training gains.

Evaluation and Benchmarking Inference Economics MobileGym-Bench GRPO MobileGym +6 more

6arXiv · cs.AI·25d ago·source ↗

VeriTrace: Cognitive-Graph Framework with Explicit Regulatory Loops for Deep Research Agents

VeriTrace introduces a cognitive-graph framework for deep research agents that replaces implicit LLM reasoning over intermediate representations with three explicit regulatory loops: interpretive update, deviation feedback, and schema revision. The system addresses contamination and error propagation in evolving mental models during complex multi-step research tasks. Using Qwen3.5-27B backbones, VeriTrace improves over the strongest matched baseline by 4.22 pp on DeepResearch Bench Insight and 5.9 pp Overall win rate on DeepConsult. With Config-DeepSeek, it achieves the strongest reproducible open-source result on DeepResearch Bench.

Frontier Model Releases Evaluation and Benchmarking DeepSeek V4 cognitive-graph DeepResearch Bench +4 more

5Hugging Face Blog·1mo ago·source ↗

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments

This Hugging Face blog post introduces OpenEnv, a framework for evaluating tool-using AI agents in real-world environments. The piece appears to address the challenge of benchmarking agentic systems that interact with external tools and environments, moving beyond static benchmarks toward dynamic, practical evaluation settings. As a tier-2 commentary piece, it likely discusses methodology, design choices, and results from applying OpenEnv to assess agent capabilities.

Evaluation and Benchmarking Agent and Tool Ecosystem Hugging Face OpenEnv