paper

Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

paperactiveprovisionalwhy-multi-step-tool-use-reinforcement-learning-collapses-and-how-supervisory-signals-fix-it-9b8ef879·1 events·first seen 5d ago

Aliases: Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

Co-occurring entities

Tool-RL-Box

More like this (12)

decoupled reinforcement learning UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning shielded reinforcement learning Multi-Task Learning Reinforcement Learning from Rich Feedback with Distributional DAgger Hierarchical Reinforcement Learning sim-to-real reinforcement learning Alternating Token-Weighted Unlearning Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning ExpRL: Exploratory RL for LLM Mid-Training rule-based reinforcement learning rewards Reinforcement Learning for Code

Recent events (1)

6arXiv · cs.LG·5d ago·source ↗

Paper diagnoses RL collapse in multi-step tool-use training and proposes supervisory signal fixes

A new arXiv preprint identifies a failure mode in reinforcement learning for LLM tool use: catastrophic collapse caused by probability spikes in control tokens that disrupt structured execution while leaving underlying tool-use capability intact. The authors systematically evaluate supervisory signals—including off-policy supervision, hint-based guidance, and erroneous example supervision—under synchronous and interleaved training schemes. Interleaving SFT with RL improves stability but degrades performance under out-of-distribution format and content evaluation. Code is released as Tool-RL-Box.

Agent and Tool Ecosystem Alignment and RLHF Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It Tool-RL-Box