5arXiv cs.CL (Computation and Language)·10d ago

VISTA: Hybrid user simulation toolkit for interactive agent evaluation

Researchers introduce VISTA, a user simulation framework designed to address limitations in current agent evaluation methods, which rely on static benchmarks that miss dynamic, multi-step failure modes. VISTA provides six metrics for measuring realism, capability coverage, and interaction effectiveness, and combines UI-based and API-based interactions in a hybrid simulator. The toolkit is evaluated in e-commerce and education customer service settings, showing more realistic and comprehensive coverage than existing approaches.

Evaluation and Benchmarking Agent and Tool Ecosystem VISTA

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·10d ago·source ↗

HiViG: History-aware visually grounded critic improves computer use agents across GUI benchmarks

Researchers introduce HiViG, a test-time framework for Computer Use Agents that addresses two weaknesses in existing critic models: short-sighted decision loops and lack of visual grounding. The system trains a multimodal critic on real GUI trajectories to maintain a compact macro-action history and verify execution coordinates against live screenshots before action execution. Evaluated on web, mobile, and desktop benchmarks, HiViG improves average success rates by 5.8% over the strongest baseline with Qwen3-VL-32B and 9.0% with Gemini-3-Flash, with both history and grounding components shown to be independently necessary.

Evaluation and Benchmarking Agent and Tool Ecosystem HiViG A History-Aware Visually Grounded Critic for Computer Use Agents Gemini 3 Flash +2 more

5Hugging Face Blog·1mo ago·source ↗

ScreenSuite: Comprehensive Evaluation Suite for GUI Agents

Hugging Face has released ScreenSuite, described as the most comprehensive evaluation suite for GUI (Graphical User Interface) agents. The suite aims to standardize and broaden benchmarking for agents that interact with visual interfaces. This addresses a gap in the evaluation ecosystem for screen-based AI agents, which are increasingly relevant as agentic systems expand into desktop and web automation tasks.

Evaluation and Benchmarking Agent and Tool Ecosystem GUI Agents ScreenSuite Hugging Face

5Hugging Face Blog·1mo ago·source ↗

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

IBM Research presents an analysis of VAKRA, a benchmark designed to evaluate agentic AI systems on reasoning and tool use capabilities. The post examines how agents fail across different task categories, surfacing systematic failure modes in multi-step reasoning and tool invocation. The analysis provides diagnostic insights into where current agent architectures break down under realistic task conditions.

Evaluation and Benchmarking AI Safety Research IBM Research Hugging Face VAKRA +1 more

5Hugging Face Blog·1mo ago·source ↗

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments

This Hugging Face blog post introduces OpenEnv, a framework for evaluating tool-using AI agents in real-world environments. The piece appears to address the challenge of benchmarking agentic systems that interact with external tools and environments, moving beyond static benchmarks toward dynamic, practical evaluation settings. As a tier-2 commentary piece, it likely discusses methodology, design choices, and results from applying OpenEnv to assess agent capabilities.

Evaluation and Benchmarking Agent and Tool Ecosystem Hugging Face OpenEnv

5arXiv · cs.CL·11d ago·source ↗

AGENTSERVESIM: Hardware-aware simulator for multi-turn LLM agent serving policies

Researchers introduce AGENTSERVESIM, a simulation framework designed to evaluate serving policies for multi-turn LLM agents without requiring dedicated accelerator hardware. The simulator models program-level execution including turn dependencies, tool-induced gaps, and KV-cache residency across HBM, host DRAM, and CXL memory hierarchies. It reproduces real-system behavior within 6% error on key performance metrics while running on commodity CPUs, enabling cost-effective exploration of scheduling, routing, and cache management policies for agentic workloads.

Training Infrastructure Inference Economics AGENTSERVESIM +1 more

6arXiv · cs.CL·22d ago·source ↗

VideoFDB: First Benchmark for Full-Duplex Audio-Visual Conversational Agent Evaluation

VideoFDB is introduced as the first benchmark targeting full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents, filling a gap where existing full-duplex benchmarks evaluate only speech. It provides 237 dyadic video-call clips covering 11 nonverbal conversational dynamics, a perception/generation taxonomy, and an LM-as-judge rubric framework. Evaluation across open- and closed-source vision-speech agents reveals systematic failure modes including captioning collapse and visual-stream ignorance, and shows current systems cannot perform the streaming joint audiovisual grounding required for natural conversation. Cascaded speech-to-avatar architectures are found to be architecturally incapable of producing full-duplex nonverbal cues.

Evaluation and Benchmarking Agent and Tool Ecosystem VideoFDB speech-to-avatar systems conversational agents +2 more

5arXiv · cs.LG·17d ago·source ↗

VLESA: Vision-Language Embodied Safety Agent for Real-Time Human Activity Monitoring

Researchers introduce VLESA, a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. The system addresses intent-dependent safety — where identical actions can be safe or dangerous depending on context — using a goal-conditioned safety Q-filter trained via GRPO and an intent-action prediction agent. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy than baselines, with the Q-filter improving action safety by over 41 percentage points through goal-conditioned constrained decoding.

AI Safety Research Multimodal Progress GRPO ASIMOV-2.0 VLESA +1 more

5arXiv · cs.AI·11d ago·source ↗

OmniGameArena: UE5 benchmark for VLM game agents with multi-round improvement dynamics

Researchers introduce OmniGameArena, a real-time benchmark of twelve Unreal Engine 5 games spanning solo, PvP, and cooperative play, designed to evaluate vision-language model agents under unified protocols across commercial VLMs, open-weight VLMs, and specialized game policies. The benchmark introduces the Improvement Dynamics Curve (IDC), an agentic-reflection harness where a tool-using LLM autonomously refines skill prompts across multiple rounds, exposing how agent performance evolves and generalizes beyond a single cold-start score. Twelve VLM agents are evaluated on the leaderboard, with four top agents further analyzed under IDC. The work addresses gaps in existing game benchmarks that report only single-attempt scores and lack multi-agent or cooperative evaluation modes.

Evaluation and Benchmarking Agent and Tool Ecosystem OmniGameArena Unreal Engine 5 Improvement Dynamics Curve