benchmark
QVal
benchmarkactiveprovisional
qval-dfb97dc6·1 events·first seen 2d agoAliases: QVal
Co-occurring entities
More like this (12)
Recent events (1)
QVal: Training-free benchmark for evaluating dense supervision signals in long-horizon LLM agents
QVal is a new training-free testbed for evaluating dense supervision signals used to guide LLM agents over long-horizon trajectories, where outcome-only rewards are too sparse. The framework measures 'Q-alignment' — whether a method's step scores match Q-values from a strong reference policy — enabling comparison of 21 methods across 4 environments and 7 methodological families without running full training pipelines. A key finding is that simple prompting baselines consistently outperform more sophisticated dense supervision methods from recent literature. The benchmark covers over 1,200 evaluation experiments across six open-weight model backbones.