paper
QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents
paperactiveprovisional
qval-cheaply-evaluating-dense-supervision-signals-for-long-horizon-llm-agents-89784419·1 events·first seen 2d agoAliases: QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents
Co-occurring entities
More like this (12)
QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM AgentsHierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode OutcomesExpRL: Exploratory RL for LLM Mid-TrainingDense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy DistillationLearning from the Self-future: On-policy Self-distillation for dLLMsScaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight VerifierContagion Networks: Evaluator Bias Propagation in Multi-Agent LLM SystemsEfficient and Sound Probabilistic Verification for AI AgentsWhy Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix ItReinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMsMulti-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided DispatchForecasting With LLMs: Improved Generalization Through Feature Steering
Recent events (1)
QVal: Training-free benchmark for evaluating dense supervision signals in long-horizon LLM agents
QVal is a new training-free testbed for evaluating dense supervision signals used to guide LLM agents over long-horizon trajectories, where outcome-only rewards are too sparse. The framework measures 'Q-alignment' — whether a method's step scores match Q-values from a strong reference policy — enabling comparison of 21 methods across 4 environments and 7 methodological families without running full training pipelines. A key finding is that simple prompting baselines consistently outperform more sophisticated dense supervision methods from recent literature. The benchmark covers over 1,200 evaluation experiments across six open-weight model backbones.