Entity · technique

Multi-Token Prediction (MTP)

techniqueactivemulti-token-prediction-mtp--6041989e·2 events·first seen May 27, 2026

Aliases: Multi-Token Prediction (MTP), Multi-Token Prediction

Co-occurring entities

speculative decoding TV loss Bebop Qwen3 On-Policy Distillation (OPD)Pair-In, Pair-Out (PIPO)Qwen3-4B LiveCodeBench AIME 2025 GPQA Diamond LongBench v2

More like this (12)

CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference 1M-token context Match Task to Objective (MTO)P-tokens masked-token modeling FastMCP MAML M-estimators M1-TTS ToaST (Tokenization with Split Trees)next-channel prediction Mondrian Conformal Prediction

Recent events (2)

6arXiv · cs.LG·Jun 11, 2026·source ↗

Bebop: MTP with rejection sampling and TV loss achieves 1.8x RL training speedup

Researchers introduce Bebop, a framework for integrating Multi-Token Prediction (MTP) into large-scale RL training pipelines for LLMs. The work identifies that MTP acceptance rates degrade during RL due to entropy fluctuations, and proposes probabilistic rejection sampling plus a novel end-to-end Total Variation (TV) loss that directly optimizes multi-step acceptance rates, achieving up to 95% acceptance rates and 25% extra inference throughput gains. Applied to Qwen3.5, Qwen3.6, and Qwen3.7 models, the method yields up to 1.8x end-to-end acceleration in async RL training. The approach eliminates the need for costly online MTP updating by using pre-RL MTP training with the proposed objectives.

Training Infrastructure Inference Economics Multi-Token Prediction (MTP)speculative decoding TV loss +3 more

6arXiv · cs.CL·May 27, 2026·source ↗

Pair-In, Pair-Out (PIPO): Unified Latent Compression and Multi-Token Prediction for Efficient LLM Inference

PIPO is a new inference efficiency framework that unifies input-side latent compression with output-side multi-token prediction (MTP) by treating them as mirror operations: a compressor folds two input tokens into one latent, while an MTP head unfolds one hidden state into an additional output token. To avoid the expensive verifier pass typically required by speculative decoding, PIPO trains a lightweight confidence head using On-Policy Distillation (OPD), which naturally aligns with rejection-sampling criteria. Experiments on Qwen3.5-4B and 9B backbones across AIME 2025, GPQA-Diamond, LiveCodeBench v6, and LongBench v2 show up to 2.64× first-token-latency speedup and +7.15 pass@4 improvement over regular decoding.

Long Context Evolution Inference Economics On-Policy Distillation (OPD)Multi-Token Prediction (MTP)speculative decoding +7 more