paper
Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It
paperactiveprovisional
why-multi-step-tool-use-reinforcement-learning-collapses-and-how-supervisory-signals-fix-it-9b8ef879·1 events·first seen 5d agoAliases: Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It
Co-occurring entities
More like this (12)
decoupled reinforcement learningUniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learningshielded reinforcement learningMulti-Task LearningReinforcement Learning from Rich Feedback with Distributional DAggerHierarchical Reinforcement Learningsim-to-real reinforcement learningAlternating Token-Weighted UnlearningUsing Reward Uncertainty to Induce Diverse Behaviour in Reinforcement LearningExpRL: Exploratory RL for LLM Mid-Trainingrule-based reinforcement learning rewardsReinforcement Learning for Code
Recent events (1)
Paper diagnoses RL collapse in multi-step tool-use training and proposes supervisory signal fixes
A new arXiv preprint identifies a failure mode in reinforcement learning for LLM tool use: catastrophic collapse caused by probability spikes in control tokens that disrupt structured execution while leaving underlying tool-use capability intact. The authors systematically evaluate supervisory signals—including off-policy supervision, hint-based guidance, and erroneous example supervision—under synchronous and interleaved training schemes. Interleaving SFT with RL improves stability but degrades performance under out-of-distribution format and content evaluation. Code is released as Tool-RL-Box.