Almanac
product

Tool-RL-Box

productactiveprovisionaltool-rl-box-7a60e53d·1 events·first seen 5d ago

Aliases: Tool-RL-Box

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.LG·5d ago·source ↗

Paper diagnoses RL collapse in multi-step tool-use training and proposes supervisory signal fixes

A new arXiv preprint identifies a failure mode in reinforcement learning for LLM tool use: catastrophic collapse caused by probability spikes in control tokens that disrupt structured execution while leaving underlying tool-use capability intact. The authors systematically evaluate supervisory signals—including off-policy supervision, hint-based guidance, and erroneous example supervision—under synchronous and interleaved training schemes. Interleaving SFT with RL improves stability but degrades performance under out-of-distribution format and content evaluation. Code is released as Tool-RL-Box.