Almanac
paper

Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

paperactiveprovisionalwhy-multi-step-tool-use-reinforcement-learning-collapses-and-how-supervisory-signals-fix-it-9b8ef879·1 events·first seen 5d ago

Aliases: Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.LG·5d ago·source ↗

Paper diagnoses RL collapse in multi-step tool-use training and proposes supervisory signal fixes

A new arXiv preprint identifies a failure mode in reinforcement learning for LLM tool use: catastrophic collapse caused by probability spikes in control tokens that disrupt structured execution while leaving underlying tool-use capability intact. The authors systematically evaluate supervisory signals—including off-policy supervision, hint-based guidance, and erroneous example supervision—under synchronous and interleaved training schemes. Interleaving SFT with RL improves stability but degrades performance under out-of-distribution format and content evaluation. Code is released as Tool-RL-Box.