benchmark

QVal

benchmarkactiveprovisionalqval-dfb97dc6·1 events·first seen 2d ago

Aliases: QVal

Co-occurring entities

QVal QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

More like this (12)

QVal QAGS MedQADE VQA-RAD QVQ-Max GQA IQL Omega-QVLA QVTo PQuAD EG-VQA CXR-VQA

Recent events (1)

6arXiv · cs.LG·2d ago·source ↗

QVal: Training-free benchmark for evaluating dense supervision signals in long-horizon LLM agents

QVal is a new training-free testbed for evaluating dense supervision signals used to guide LLM agents over long-horizon trajectories, where outcome-only rewards are too sparse. The framework measures 'Q-alignment' — whether a method's step scores match Q-values from a strong reference policy — enabling comparison of 21 methods across 4 environments and 7 methodological families without running full training pipelines. A key finding is that simple prompting baselines consistently outperform more sophisticated dense supervision methods from recent literature. The benchmark covers over 1,200 evaluation experiments across six open-weight model backbones.

Evaluation and Benchmarking Agent and Tool Ecosystem QVal QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents QVal +2 more