What it is
Speculative decoding is an inference-time acceleration technique for generative models. The core idea exploits a fundamental asymmetry: verifying a sequence of tokens is much cheaper than generating them one at a time. A small, fast draft model proposes a block of candidate tokens; the large target model then evaluates all of them in a single parallel forward pass, accepting tokens that match what it would have generated and discarding the first mismatch. Because the target model processes multiple tokens per forward pass instead of one, wall-clock latency drops — without any change to the output distribution.
How it works
The canonical loop has three steps:
1. Draft: the small model autoregressively generates k candidate tokens. 2. Verify: the target model runs one forward pass over the input plus all k candidates, producing logits for each position in parallel. 3. Accept/reject: a rejection-sampling criterion compares draft and target probabilities token by token. All tokens up to (and including) the first rejection are accepted; the target model samples a corrected token at the rejection point, and the loop restarts.
The speedup comes from step 2: one target-model forward pass can accept anywhere from 1 to k tokens. When the draft model is well-aligned with the target, acceptance rates are high and throughput scales roughly with k. When alignment is poor — as happens during reinforcement learning when the policy drifts — acceptance rates collapse and the overhead of drafting dominates.
Why it matters
Speculative decoding is one of the few inference optimizations that reduces latency (time-to-last-token) rather than just improving throughput under batching. For interactive applications, agentic loops, and streaming use cases, this distinction is critical. It also composes with other techniques: KV caching, quantization, and blockwise decoding can all be layered on top.
Hugging Face shipped the technique as "assisted generation" in its Transformers library in May 2023, making it accessible across the open-weights ecosystem. The same year, it was applied to Whisper for approximately 2× ASR speedup, demonstrating that the technique generalizes beyond text LLMs.
Variants and extensions
The field has moved rapidly from the two-model baseline toward a richer design space:
Self-speculative decoding (LayerSkip)
LayerSkip eliminates the separate draft model entirely by using early exit from transformer layers as the drafter. The same model generates a draft by exiting partway through, then verifies with the full stack. This removes the memory overhead of a second model and simplifies deployment, at the cost of requiring early-exit training.
Removing the tokenizer constraint (Universal Assisted Generation)
Standard speculative decoding requires draft and target models to share a tokenizer. Hugging Face's Universal Assisted Generation removes this constraint, letting any smaller open-weights model serve as a drafter — significantly widening the practical pairing space.
Dynamic speculation depth
Fixed lookahead k is suboptimal: easy tokens warrant longer drafts; hard tokens waste budget. Hugging Face's dynamic speculation adjusts k at runtime, improving throughput over fixed-lookahead baselines.
Tree-based drafting (Graft)
Rather than a linear draft sequence, tree-based methods like EAGLE and its successors maintain a draft tree of branching candidate continuations, increasing the probability that at least one branch is accepted. Graft improves on this with a prune-then-retrieve strategy: dynamic-depth pruning reduces VRAM and compute overhead, while retrieval-based token compensation fills topological gaps in the tree at near-zero cost. On Qwen3-235B, Graft improves average speedup over EAGLE-3 by up to 21.8% and achieves up to 5.41× on short-context benchmarks — with no training required.
Multi-tier verification (VIA-SD)
The binary accept/reject step is a bottleneck: uncertain tokens that would be rejected by the full model still consume a full verification pass. VIA-SD introduces a three-tier scheme — a lightweight "slim verifier" submodel handles medium-confidence tokens, reserving full-model verification only for the most uncertain cases. This reduces rejection rates by 0.10–0.22 and adds 10–20% speedup over strong speculative decoding baselines, composing with existing frameworks without retraining.
Diffusion LLMs (SimSD)
Diffusion language models use bidirectional attention and masked language modeling, making standard token-level speculative decoding inapplicable. SimSD introduces a plug-and-play masking strategy that injects reference tokens from a draft model and applies a custom attention mask, enabling valid logit computation for drafted tokens in a single forward pass. Evaluated on SDAR-family dLLMs, SimSD achieves up to 7.46× decoding throughput while maintaining or improving generation quality, and is compatible with KV cache and blockwise decoding.
RL training pipelines (Bebop)
Speculative decoding via Multi-Token Prediction (MTP) degrades during reinforcement learning because policy entropy fluctuates, collapsing acceptance rates. Bebop addresses this with probabilistic rejection sampling and a Total Variation (TV) loss that directly optimizes multi-step acceptance rates end-to-end, reaching 95% acceptance rates and 25% extra inference throughput gains. Applied to Qwen3.5/3.6/3.7 in async RL training, Bebop achieves up to 1.8× end-to-end acceleration without requiring costly online MTP updating.
Hardware-specific deployments
The technique has been validated beyond GPU clusters: Hugging Face and Intel demonstrate speculative decoding for Qwen3-8B on Intel Core Ultra client hardware using depth-pruned draft models, and for StarCoder on Intel Xeon CPUs via the Optimum Intel library — showing the approach is viable for on-device and commodity-server inference, not just data-center workloads.
Tradeoffs and when not to use it
Speculative decoding is most effective when:
- The draft model has high acceptance rate (well-aligned with the target on the task distribution).
- The workload is latency-sensitive rather than throughput-maximizing under heavy batching (where continuous batching already amortizes forward-pass cost).
- The target model is large enough that the per-token cost dominates over the drafting overhead.
It is less effective — or actively harmful — when:
- The prompt distribution is highly entropic or out-of-distribution for the drafter (acceptance rates collapse).
- The serving system is already heavily batched (the parallel verification advantage shrinks).
- During RL training, unless acceptance rates are explicitly stabilized (the Bebop problem).
Where it's heading
The active research frontier is moving in three directions simultaneously: better draft construction (tree pruning, retrieval, self-speculation), smarter verification (multi-tier routing, confidence-gated passes), and new modalities (diffusion LLMs, speech models, RL training loops). The technique has also crossed from research into production infrastructure across multiple hardware vendors and serving frameworks, suggesting it is becoming a standard layer in the inference stack rather than an optional optimization.




