QVal: Training-free benchmark for evaluating dense supervision signals in long-horizon LLM agents
QVal is a new training-free testbed for evaluating dense supervision signals used to guide LLM agents over long-horizon trajectories, where outcome-only rewards are too sparse. The framework measures 'Q-alignment' — whether a method's step scores match Q-values from a strong reference policy — enabling comparison of 21 methods across 4 environments and 7 methodological families without running full training pipelines. A key finding is that simple prompting baselines consistently outperform more sophisticated dense supervision methods from recent literature. The benchmark covers over 1,200 evaluation experiments across six open-weight model backbones.
Related guides (3)
Related events (8)
RiVER framework enables RL training of LLMs on tasks without ground-truth solutions
Researchers introduce RiVER (Ranking-induced VERifiable framework), a reinforcement learning approach that trains LLMs on score-based optimization tasks using deterministic execution feedback as continuous rewards, without requiring ground-truth answers. The method addresses two failure modes in group-relative RL with continuous rewards—scale dominance and frequency dominance—via calibrated, instance-wise reward shaping. Applied to Qwen3-8B and GLM-Z1-9B-0414 on competitive programming tasks, RiVER improves ALE rating rank by ~9% and also transfers to exact-solution benchmarks (LiveCodeBench, USACO) with 2-4% absolute gains, unlike raw-score baselines. The result suggests score-based heuristic tasks can serve as general-purpose RL training environments for coding ability.
Training framework reduces calibration error 60%+ in Medical VQA multimodal LLMs
A new arXiv preprint proposes a finetuning framework to improve verbalized uncertainty calibration in multimodal LLMs applied to Medical Visual Question Answering. The composite loss function combines Brier-style calibration, anchor regularization, contrastive image-text alignment, and KL-based stabilization, evaluated on MedGemma 4B IT and Qwen2-VL 7B Instruct across three medical VQA benchmarks. The method reduces calibration error by 60% or more and improves discrimination by 26% or more while preserving predictive accuracy, outperforming prompting-, sampling-, and training-based baselines.
Progress Advantage: Annotation-Free Step-Level Scoring for LLM Agents via RL Post-Training
Researchers introduce 'progress advantage,' a method that derives implicit step-level reward signals for LLM agents directly from the log-probability ratio between an RL-trained policy and its reference policy, without requiring dedicated process reward model training. The approach is shown to recover the optimal advantage function under a general stochastic MDP formulation, making it annotation-free and domain-agnostic. Validated across five benchmarks and four model families on tasks including test-time scaling, uncertainty quantification, and failure attribution, it outperforms confidence-based baselines and even dedicated trained reward models. The result is practically significant because building process reward models for agentic settings is currently a major bottleneck.
FORCE: Efficient RL fine-tuning for Vision-Language-Action models via value-calibrated warm-up and self-distillation
Researchers introduce FORCE, a 3-stage reinforcement learning fine-tuning framework for Vision-Language-Action (VLA) models that addresses sample inefficiency caused by unstable Q-functions and low-quality exploration data. The framework uses a Value-Calibrated Warm-Up phase followed by Q-function-filtered policy updates, eliminating the need for costly human interventions during training. Evaluated on simulation and real-world robotic tasks, FORCE achieves a 79% absolute improvement in task success rates, outperforms prior RL methods by 10%, and accelerates training by 32.5%.
LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?
This Hugging Face blog post introduces LAVE (LLM-Assisted Visual Evaluation), a zero-shot VQA evaluation methodology applied to the Docmatix dataset. The post investigates whether large vision-language models can perform document visual question answering without task-specific fine-tuning by leveraging LLM-based evaluation metrics. The analysis probes the gap between zero-shot and fine-tuned performance on document understanding tasks, raising questions about the continued necessity of supervised adaptation for VQA.
Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models
Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.
Language models linearly encode a 'value axis' tracking expected goal success, study finds
Researchers construct a 'value axis' in Qwen3-8B's activation space using synthetic in-context RL data, finding that this axis distinguishes high vs. low confidence, backtracking vs. non-backtracking rollouts, and correct vs. corrupted code. Steering along this axis causally modulates self-correction behavior and verbosity, while DPO training shifts the internal value of rewarded behaviors. Applied to real-world settings, the axis reveals that Qwen assigns low internal value to politically sensitive queries post-training and that SFT increases domain-specific confidence. The findings suggest LLMs linearly encode an estimate of expected goal success that shapes their generative behavior.
Argus benchmark evaluates uncertainty quantification methods for computer-use GUI agents across VLMs and datasets
Researchers introduce Argus, a cross-regime benchmark for post-hoc uncertainty quantification (UQ) in single-step GUI grounding agents, covering 27 methods across 4 open-weight VLMs and 4 datasets, plus an 8-method closed-source matrix across 3 frontier vendors. The central finding is 'selective transfer': UQ rankings are stable across datasets for a fixed model but degrade across model classes and observable interfaces, with cross-tier transfer to closed-source vendors averaging only +0.08 Spearman correlation. Hidden-state and density methods prove most stable for open-weight models, while conformal click regions reveal that score-level discrimination alone is insufficient for deployment safety. The benchmark releases per-item records and analysis scripts to support regime-aware UQ selection in GUI agents.


