6arXiv cs.CL (Computation and Language)·9h ago

Argus benchmark evaluates uncertainty quantification methods for computer-use GUI agents across VLMs and datasets

Researchers introduce Argus, a cross-regime benchmark for post-hoc uncertainty quantification (UQ) in single-step GUI grounding agents, covering 27 methods across 4 open-weight VLMs and 4 datasets, plus an 8-method closed-source matrix across 3 frontier vendors. The central finding is 'selective transfer': UQ rankings are stable across datasets for a fixed model but degrade across model classes and observable interfaces, with cross-tier transfer to closed-source vendors averaging only +0.08 Spearman correlation. Hidden-state and density methods prove most stable for open-weight models, while conformal click regions reveal that score-level discrimination alone is insufficient for deployment safety. The benchmark releases per-item records and analysis scripts to support regime-aware UQ selection in GUI agents.

Evaluation and Benchmarking AI Safety Research Agent and Tool Ecosystem Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets CoCoA-1MCA Argus Mahalanobis distance SAPLMA

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·9h ago·source ↗

ToolBench-X benchmarks LLM agents under tool-environment unreliability

A new arXiv preprint introduces ToolBench-X, a benchmark for evaluating LLM agents under five structured hazard types including Specification Drift, Invocation Error, Execution Failure, Output Drift, and Cross-source Conflict. Each injected hazard remains solvable via recovery paths such as retrying, fallback, or cross-checking, enabling measurement of agent resilience rather than just function-call accuracy. Experiments reveal a substantial reliability gap: agents that perform well in clean environments frequently fail under recoverable hazards, with failures driven by poor hazard diagnosis rather than insufficient tool-use volume or inference budget. The findings argue for shifting tool-use evaluation toward task completion under realistic, unreliable conditions.

Evaluation and Benchmarking Agent and Tool Ecosystem ToolBench-X Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability

5arXiv · cs.CL·16d ago·source ↗

Three-axis uncertainty estimation framework for code generation outperforms NL-derived baselines

A new arXiv preprint argues that uncertainty estimation (UE) for code generation requires code-specific design rather than methods ported from natural language. The authors propose three orthogonal uncertainty axes—lexical (token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency)—grounded in properties unique to code: token fragility, intent-code gap, and executability. Evaluated across five code LLMs, their ensemble improves average AUROC from 0.696 to 0.776 (+8.1 points) over the strongest NL-derived baseline, with a single-pass token entropy method on Qwen3-14B matching multi-pass baselines at 3x lower cost. The work is directly relevant to safe deployment of LLMs in agentic coding pipelines.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen3-14B Code Is More Than Text: Uncertainty Estimation for Code Generation

5arXiv · cs.CL·2d ago·source ↗

MacAgentBench: New benchmark for AI agents on real-world macOS desktop tasks

MacAgentBench introduces a 676-task benchmark across 25 macOS applications designed to evaluate computer use agents (CUAs) with framework augmentation and fine-grained multi-checkpoint scoring, addressing gaps in existing binary-evaluation benchmarks. Nearly 60% of tasks involve both GUI and CLI interaction, and the benchmark tests 16 models across three agent frameworks. The best result — Claude Opus 4.6 on the OpenClaw framework — achieves 73.7% Pass@1, with performance gains attributed primarily to the skill library rather than framework design. Fine-grained metrics reveal that models with similar Pass@1 scores can differ substantially in sub-goal completion, highlighting limitations of coarse evaluation.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 OpenClaw MacAgentBench +1 more

6arXiv · cs.AI·17d ago·source ↗

AARRI-Bench evaluates frontier LLMs and agents on granular research-intern-level tasks

Researchers introduce AARR (Act As a Real Researcher), a new benchmark series targeting whether AI agents can emulate the professionalism, thoroughness, and nuanced judgment of human researchers in granular research scenarios—not just macro-level task execution. The first benchmark, AARRI-Bench, tests frontier models and agentic harnesses, finding that even the best configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success, frequently missing subtle but critical details obvious to human researchers. The work argues that closing the gap requires deeper modeling of research behavior rather than more complex scaffolding.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 SWE-bench AARRI-Bench +2 more

6arXiv · cs.CL·15d ago·source ↗

HiViG: History-aware visually grounded critic improves computer use agents across GUI benchmarks

Researchers introduce HiViG, a test-time framework for Computer Use Agents that addresses two weaknesses in existing critic models: short-sighted decision loops and lack of visual grounding. The system trains a multimodal critic on real GUI trajectories to maintain a compact macro-action history and verify execution coordinates against live screenshots before action execution. Evaluated on web, mobile, and desktop benchmarks, HiViG improves average success rates by 5.8% over the strongest baseline with Qwen3-VL-32B and 9.0% with Gemini-3-Flash, with both history and grounding components shown to be independently necessary.

Evaluation and Benchmarking Agent and Tool Ecosystem HiViG A History-Aware Visually Grounded Critic for Computer Use Agents Gemini 3 Flash +2 more

6arXiv · cs.AI·20d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem Benchmark Everything Everywhere All at Once Benchmark Agent

7arXiv · cs.AI·21d ago·source ↗

AutoLab benchmark evaluates frontier models on ultra long-horizon iterative research and engineering tasks

AutoLab is a new benchmark of 36 expert-curated tasks across system optimization, puzzle-solving, model development, and CUDA kernel optimization, designed to test agents on sustained closed-loop improvement under wall-clock budgets rather than single-turn or short-horizon settings. Evaluation of 17 frontier models finds that persistence in iterative benchmarking and feedback incorporation — not initial attempt quality — is the dominant success predictor. Claude Opus 4.6 stands out as the strongest performer, while most models including proprietary ones either terminate early or exhaust budgets with minimal progress. The benchmark, harness, and task artifacts are open-sourced.

Frontier Model Releases Evaluation and Benchmarking Claude Opus 4.6 AutoLab Anthropic +1 more

6arXiv · cs.CL·2d ago·source ↗

BabelJudge: Benchmark for measuring LLM-as-a-judge reliability across languages and agent trajectories

BabelJudge is a new open-source benchmark and audit framework that systematically measures four failure modes in LLM-as-a-judge systems: position bias, verbosity bias, order inconsistency, and cross-lingual degradation. The framework uses a 'gold-labelling by degradation' technique to generate labeled evaluation pairs without human annotation. Evaluation of Qwen2.5-7B-Instruct-4bit across English, Hindi, Arabic, and Swahili reveals severe cross-lingual reliability drops, with Swahili order consistency collapsing to near-random (0.480). The framework is extended to agentic evaluation with nine trajectory-level perturbations and three new metrics, released as a Python package supporting 11 judge backends.

Evaluation and Benchmarking Agent and Tool Ecosystem BabelJudge Qwen2.5-7B-Instruct-1M Shreyaskc