Almanac
product

PROVE

productactiveprovisionalprove-4c32b187·1 events·first seen 13d ago

Aliases: PROVE

Co-occurring entities

More like this (12)

Recent events (1)

7arXiv · cs.CL·13d ago·source ↗

PROVE framework trains LLMs for multi-step tool use via stateful MCP environments and programmatic rewards

Researchers introduce PROVE (Programmatic Rewards On Verified Environments), a framework for training LLMs to orchestrate multi-step tool calls using reinforcement learning. The system includes a library of 20 stateful MCP servers with 343 tools, an automated data synthesis pipeline that grounds training queries in live server state, and a multi-component programmatic reward function requiring no judge model. Training four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with ~13K examples yields gains of up to +10.2 on BFCL Multi-Turn, +6.8 on tau2-bench, and +6.5 on T-Eval, demonstrating consistent improvements in multi-step tool orchestration.