Almanac
technique

TV loss

techniqueactiveprovisionaltv-loss-00ea6401·1 events·first seen 6d ago

Aliases: TV loss

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.LG·6d ago·source ↗

Bebop: MTP with rejection sampling and TV loss achieves 1.8x RL training speedup

Researchers introduce Bebop, a framework for integrating Multi-Token Prediction (MTP) into large-scale RL training pipelines for LLMs. The work identifies that MTP acceptance rates degrade during RL due to entropy fluctuations, and proposes probabilistic rejection sampling plus a novel end-to-end Total Variation (TV) loss that directly optimizes multi-step acceptance rates, achieving up to 95% acceptance rates and 25% extra inference throughput gains. Applied to Qwen3.5, Qwen3.6, and Qwen3.7 models, the method yields up to 1.8x end-to-end acceleration in async RL training. The approach eliminates the need for costly online MTP updating by using pre-RL MTP training with the proposed objectives.