6arXiv cs.LG (Machine Learning)·2d ago

QVal: Training-free benchmark for evaluating dense supervision signals in long-horizon LLM agents

QVal is a new training-free testbed for evaluating dense supervision signals used to guide LLM agents over long-horizon trajectories, where outcome-only rewards are too sparse. The framework measures 'Q-alignment' — whether a method's step scores match Q-values from a strong reference policy — enabling comparison of 21 methods across 4 environments and 7 methodological families without running full training pipelines. A key finding is that simple prompting baselines consistently outperform more sophisticated dense supervision methods from recent literature. The benchmark covers over 1,200 evaluation experiments across six open-weight model backbones.

Evaluation and Benchmarking Agent and Tool Ecosystem Alignment and RLHF QVal QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents QVal QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

Related guides (3)

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI to Do What We Actually Want

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

6arXiv · cs.LG·7d ago·source ↗

RiVER framework enables RL training of LLMs on tasks without ground-truth solutions

Researchers introduce RiVER (Ranking-induced VERifiable framework), a reinforcement learning approach that trains LLMs on score-based optimization tasks using deterministic execution feedback as continuous rewards, without requiring ground-truth answers. The method addresses two failure modes in group-relative RL with continuous rewards—scale dominance and frequency dominance—via calibrated, instance-wise reward shaping. Applied to Qwen3-8B and GLM-Z1-9B-0414 on competitive programming tasks, RiVER improves ALE rating rank by ~9% and also transfers to exact-solution benchmarks (LiveCodeBench, USACO) with 2-4% absolute gains, unlike raw-score baselines. The result suggests score-based heuristic tasks can serve as general-purpose RL training environments for coding ability.

Evaluation and Benchmarking Alignment and RLHF USACO Qwen3-4B LiveCodeBench +3 more

5arXiv · cs.CL·7d ago·source ↗

Training framework reduces calibration error 60%+ in Medical VQA multimodal LLMs

A new arXiv preprint proposes a finetuning framework to improve verbalized uncertainty calibration in multimodal LLMs applied to Medical Visual Question Answering. The composite loss function combines Brier-style calibration, anchor regularization, contrastive image-text alignment, and KL-based stabilization, evaluated on MedGemma 4B IT and Qwen2-VL 7B Instruct across three medical VQA benchmarks. The method reduces calibration error by 60% or more and improves discrimination by 26% or more while preserving predictive accuracy, outperforming prompting-, sampling-, and training-based baselines.

Evaluation and Benchmarking AI Safety Research Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA Qwen2.5-7B-Instruct-1M MedGemma 4B IT +1 more

7arXiv · cs.AI·8d ago·source ↗

Progress Advantage: Annotation-Free Step-Level Scoring for LLM Agents via RL Post-Training

Researchers introduce 'progress advantage,' a method that derives implicit step-level reward signals for LLM agents directly from the log-probability ratio between an RL-trained policy and its reference policy, without requiring dedicated process reward model training. The approach is shown to recover the optimal advantage function under a general stochastic MDP formulation, making it annotation-free and domain-agnostic. Validated across five benchmarks and four model families on tasks including test-time scaling, uncertainty quantification, and failure attribution, it outperforms confidence-based baselines and even dedicated trained reward models. The result is practically significant because building process reward models for agentic settings is currently a major bottleneck.

Evaluation and Benchmarking Agent and Tool Ecosystem progress advantage Progress Advantage for LLM Agents +1 more

5arXiv · cs.AI·8d ago·source ↗

FORCE: Efficient RL fine-tuning for Vision-Language-Action models via value-calibrated warm-up and self-distillation

Researchers introduce FORCE, a 3-stage reinforcement learning fine-tuning framework for Vision-Language-Action (VLA) models that addresses sample inefficiency caused by unstable Q-functions and low-quality exploration data. The framework uses a Value-Calibrated Warm-Up phase followed by Q-function-filtered policy updates, eliminating the need for costly human interventions during training. Evaluated on simulation and real-world robotic tasks, FORCE achieves a 79% absolute improvement in task success rates, outperforms prior RL methods by 10%, and accelerates training by 32.5%.

Agent and Tool Ecosystem Alignment and RLHF FORCE FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation

4Hugging Face Blog·1mo ago·source ↗

LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?

This Hugging Face blog post introduces LAVE (LLM-Assisted Visual Evaluation), a zero-shot VQA evaluation methodology applied to the Docmatix dataset. The post investigates whether large vision-language models can perform document visual question answering without task-specific fine-tuning by leveraging LLM-based evaluation metrics. The analysis probes the gap between zero-shot and fine-tuned performance on document understanding tasks, raising questions about the continued necessity of supervised adaptation for VQA.

Evaluation and Benchmarking Multimodal Progress Visual Question Answering LAVE Hugging Face +1 more

5arXiv · cs.LG·15d ago·source ↗

Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models

Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.

Evaluation and Benchmarking Multimodal Progress Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models Act2Answer

7arXiv · cs.CL·17d ago·source ↗

Language models linearly encode a 'value axis' tracking expected goal success, study finds

Researchers construct a 'value axis' in Qwen3-8B's activation space using synthetic in-context RL data, finding that this axis distinguishes high vs. low confidence, backtracking vs. non-backtracking rollouts, and correct vs. corrupted code. Steering along this axis causally modulates self-correction behavior and verbosity, while DPO training shifts the internal value of rewarded behaviors. Applied to real-world settings, the axis reveals that Qwen assigns low internal value to politically sensitive queries post-training and that SFT increases domain-specific confidence. The findings suggest LLMs linearly encode an estimate of expected goal success that shapes their generative behavior.

AI Safety Research Alignment and RLHF The Value Axis: Language Models Encode Whether They're on the Right Track Direct Preference Optimization (DPO)Qwen3-4B

6arXiv · cs.CL·8d ago·source ↗

Argus benchmark evaluates uncertainty quantification methods for computer-use GUI agents across VLMs and datasets

Researchers introduce Argus, a cross-regime benchmark for post-hoc uncertainty quantification (UQ) in single-step GUI grounding agents, covering 27 methods across 4 open-weight VLMs and 4 datasets, plus an 8-method closed-source matrix across 3 frontier vendors. The central finding is 'selective transfer': UQ rankings are stable across datasets for a fixed model but degrade across model classes and observable interfaces, with cross-tier transfer to closed-source vendors averaging only +0.08 Spearman correlation. Hidden-state and density methods prove most stable for open-weight models, while conformal click regions reveal that score-level discrimination alone is insufficient for deployment safety. The benchmark releases per-item records and analysis scripts to support regime-aware UQ selection in GUI agents.

Evaluation and Benchmarking AI Safety Research Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets CoCoA-1MCA Argus +3 more