Almanac
Concept guide · In-depth

Speculative Decoding: Draft-Then-Verify Inference Acceleration

speculative decodingIn-depthactive·v1 · live·generated 6d ago
TL;DRSpeculative decoding exploits the asymmetry between generating tokens and verifying them: a cheap draft model proposes several tokens at once, and the target model checks them all in a single parallel forward pass, accepting correct guesses and discarding the rest. What began as a two-model trick for autoregressive LLMs has since branched into self-speculative variants, diffusion-model adaptations, multi-tier verification schemes, and tight integration with RL training pipelines — making it one of the most actively extended inference primitives in the field.

Key takeaways

  • The core mechanism lets a target model accept multiple tokens per forward pass, cutting wall-clock latency without changing output distribution.
  • LayerSkip eliminates the need for a separate draft model by using early-exit transformer layers as the drafter, reducing memory overhead.
  • SimSD extends speculative decoding to diffusion LLMs — previously incompatible due to bidirectional attention — achieving up to 7.46× throughput improvement.
  • Graft's prune-then-retrieve tree construction improves average speedup over EAGLE-3 by up to 21.8% on Qwen3-235B with no training required.
  • VIA-SD's three-tier verification routes medium-confidence tokens to a slim verifier submodel, reducing rejection rates by 0.10–0.22 and adding 10–20% over strong speculative decoding baselines.
  • Bebop applies Multi-Token Prediction with TV loss to RL training pipelines, reaching 95% acceptance rates and 1.8× end-to-end acceleration on async RL runs.

What it is

Speculative decoding is an inference-time acceleration technique for generative models. The core idea exploits a fundamental asymmetry: verifying a sequence of tokens is much cheaper than generating them one at a time. A small, fast draft model proposes a block of candidate tokens; the large target model then evaluates all of them in a single parallel forward pass, accepting tokens that match what it would have generated and discarding the first mismatch. Because the target model processes multiple tokens per forward pass instead of one, wall-clock latency drops — without any change to the output distribution.

How it works

The canonical loop has three steps:

1. Draft: the small model autoregressively generates k candidate tokens. 2. Verify: the target model runs one forward pass over the input plus all k candidates, producing logits for each position in parallel. 3. Accept/reject: a rejection-sampling criterion compares draft and target probabilities token by token. All tokens up to (and including) the first rejection are accepted; the target model samples a corrected token at the rejection point, and the loop restarts.

The speedup comes from step 2: one target-model forward pass can accept anywhere from 1 to k tokens. When the draft model is well-aligned with the target, acceptance rates are high and throughput scales roughly with k. When alignment is poor — as happens during reinforcement learning when the policy drifts — acceptance rates collapse and the overhead of drafting dominates.

Why it matters

Speculative decoding is one of the few inference optimizations that reduces latency (time-to-last-token) rather than just improving throughput under batching. For interactive applications, agentic loops, and streaming use cases, this distinction is critical. It also composes with other techniques: KV caching, quantization, and blockwise decoding can all be layered on top.

Hugging Face shipped the technique as "assisted generation" in its Transformers library in May 2023, making it accessible across the open-weights ecosystem. The same year, it was applied to Whisper for approximately 2× ASR speedup, demonstrating that the technique generalizes beyond text LLMs.

Variants and extensions

The field has moved rapidly from the two-model baseline toward a richer design space:

Self-speculative decoding (LayerSkip)

LayerSkip eliminates the separate draft model entirely by using early exit from transformer layers as the drafter. The same model generates a draft by exiting partway through, then verifies with the full stack. This removes the memory overhead of a second model and simplifies deployment, at the cost of requiring early-exit training.

Removing the tokenizer constraint (Universal Assisted Generation)

Standard speculative decoding requires draft and target models to share a tokenizer. Hugging Face's Universal Assisted Generation removes this constraint, letting any smaller open-weights model serve as a drafter — significantly widening the practical pairing space.

Dynamic speculation depth

Fixed lookahead k is suboptimal: easy tokens warrant longer drafts; hard tokens waste budget. Hugging Face's dynamic speculation adjusts k at runtime, improving throughput over fixed-lookahead baselines.

Tree-based drafting (Graft)

Rather than a linear draft sequence, tree-based methods like EAGLE and its successors maintain a draft tree of branching candidate continuations, increasing the probability that at least one branch is accepted. Graft improves on this with a prune-then-retrieve strategy: dynamic-depth pruning reduces VRAM and compute overhead, while retrieval-based token compensation fills topological gaps in the tree at near-zero cost. On Qwen3-235B, Graft improves average speedup over EAGLE-3 by up to 21.8% and achieves up to 5.41× on short-context benchmarks — with no training required.

Multi-tier verification (VIA-SD)

The binary accept/reject step is a bottleneck: uncertain tokens that would be rejected by the full model still consume a full verification pass. VIA-SD introduces a three-tier scheme — a lightweight "slim verifier" submodel handles medium-confidence tokens, reserving full-model verification only for the most uncertain cases. This reduces rejection rates by 0.10–0.22 and adds 10–20% speedup over strong speculative decoding baselines, composing with existing frameworks without retraining.

Diffusion LLMs (SimSD)

Diffusion language models use bidirectional attention and masked language modeling, making standard token-level speculative decoding inapplicable. SimSD introduces a plug-and-play masking strategy that injects reference tokens from a draft model and applies a custom attention mask, enabling valid logit computation for drafted tokens in a single forward pass. Evaluated on SDAR-family dLLMs, SimSD achieves up to 7.46× decoding throughput while maintaining or improving generation quality, and is compatible with KV cache and blockwise decoding.

RL training pipelines (Bebop)

Speculative decoding via Multi-Token Prediction (MTP) degrades during reinforcement learning because policy entropy fluctuates, collapsing acceptance rates. Bebop addresses this with probabilistic rejection sampling and a Total Variation (TV) loss that directly optimizes multi-step acceptance rates end-to-end, reaching 95% acceptance rates and 25% extra inference throughput gains. Applied to Qwen3.5/3.6/3.7 in async RL training, Bebop achieves up to 1.8× end-to-end acceleration without requiring costly online MTP updating.

Hardware-specific deployments

The technique has been validated beyond GPU clusters: Hugging Face and Intel demonstrate speculative decoding for Qwen3-8B on Intel Core Ultra client hardware using depth-pruned draft models, and for StarCoder on Intel Xeon CPUs via the Optimum Intel library — showing the approach is viable for on-device and commodity-server inference, not just data-center workloads.

Tradeoffs and when not to use it

Speculative decoding is most effective when:

  • The draft model has high acceptance rate (well-aligned with the target on the task distribution).
  • The workload is latency-sensitive rather than throughput-maximizing under heavy batching (where continuous batching already amortizes forward-pass cost).
  • The target model is large enough that the per-token cost dominates over the drafting overhead.

It is less effective — or actively harmful — when:

  • The prompt distribution is highly entropic or out-of-distribution for the drafter (acceptance rates collapse).
  • The serving system is already heavily batched (the parallel verification advantage shrinks).
  • During RL training, unless acceptance rates are explicitly stabilized (the Bebop problem).

Where it's heading

The active research frontier is moving in three directions simultaneously: better draft construction (tree pruning, retrieval, self-speculation), smarter verification (multi-tier routing, confidence-gated passes), and new modalities (diffusion LLMs, speech models, RL training loops). The technique has also crossed from research into production infrastructure across multiple hardware vendors and serving frameworks, suggesting it is becoming a standard layer in the inference stack rather than an optional optimization.

Speculative decoding: core loop and major variants

Speculative decoding variants and extensions

VariantDraft sourceTraining required?Reported speedupKey constraint lifted
Standard (two-model)Separate smaller modelNo (inference-time)~2–3×Baseline
LayerSkip (self-speculative)Early-exit layers of target modelYes (early-exit training)Inference speedup; no extra memoryNo separate draft model needed
Universal Assisted GenerationAny mismatched-vocab assistantNoDraft/target tokenizer mismatch
Dynamic speculationSeparate smaller modelNoAdaptive; beats fixed lookaheadFixed speculation depth
Graft (tree-based)Pruned draft tree + retrievalNoUp to 5.41×; +21.8% over EAGLE-3VRAM/compute overhead of deep trees
VIA-SD (multi-tier)Slim verifier submodelNo2.5–3× over standard; +10–20% over SD baselinesBinary accept/reject bottleneck
SimSD (diffusion LLMs)Draft dLLM + masking strategyNoUp to 7.46×Bidirectional attention incompatibility
Bebop (RL training)MTP head + TV lossPre-RL MTP training1.8× end-to-end RL trainingMTP acceptance degradation during RL

Speedups are as reported in the respective events; — indicates not specified in the bundle.

Timeline

  1. Hugging Face ships assisted generation (speculative decoding) in Transformers

  2. Speculative decoding applied to Whisper for ~2× ASR speedup

  3. Dynamic speculation lookahead ships; Universal Assisted Generation removes tokenizer constraint

  4. LayerSkip self-speculative decoding covered; no separate draft model required

  5. Graft prune-then-retrieve tree construction beats EAGLE-3 by up to 21.8% on Qwen3-235B

  6. SimSD extends speculative decoding to diffusion LLMs (7.46× throughput); Bebop achieves 1.8× RL training speedup with MTP + TV loss

Related topics

Hugging FaceHugging Face TransformersAssisted GenerationWhisperGraftKV CacheIntelIntel XeonOptimum-Intel

FAQ

Does speculative decoding change the model's output distribution?

No — the rejection-sampling step guarantees that only tokens the target model would have generated are accepted, preserving the original distribution exactly.

When does speculative decoding NOT help?

When the draft model's acceptance rate is low (e.g., highly entropic or adversarial prompts), the overhead of drafting and rejecting can negate the gains — a problem Bebop specifically addresses during RL training.

Do the draft and target models need to share a tokenizer?

Not anymore — Universal Assisted Generation from Hugging Face removes that constraint, enabling any smaller model to serve as a drafter regardless of vocabulary differences.

Can speculative decoding be used without a second model?

Yes — LayerSkip uses early-exit layers of the target model itself as the drafter, eliminating the memory cost of a separate draft model.

Does it work for non-autoregressive models like diffusion LLMs?

SimSD shows it can: a plug-and-play masking strategy adapts the mechanism for bidirectional-attention diffusion LLMs, achieving up to 7.46× throughput improvement.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live6d ago

Related guides (4)

More on speculative decoding (6)

5Hugging Face Blog·1mo ago·source ↗

Speculative Decoding for 2x Faster Whisper Inference

Hugging Face demonstrates applying speculative decoding to OpenAI's Whisper speech recognition model, achieving approximately 2x inference speedup. The technique uses a smaller draft model to propose token sequences that the larger target model then verifies, reducing the number of full forward passes required. This post covers implementation details using the Hugging Face Transformers library and benchmarks the approach across different hardware configurations.

5Hugging Face Blog·1mo ago·source ↗

Faster Text Generation with Self-Speculative Decoding via LayerSkip

This Hugging Face blog post covers LayerSkip, a self-speculative decoding technique that accelerates text generation by using early exit from transformer layers to draft tokens, then verifying them with the full model. Unlike standard speculative decoding, LayerSkip requires no separate draft model, reducing memory overhead while still achieving inference speedups. The post likely covers integration with the Hugging Face ecosystem and practical performance benchmarks.

5Hugging Face Blog·1mo ago·source ↗

Assisted Generation: a new direction toward low-latency text generation

Hugging Face introduces assisted generation (speculative decoding) as a practical technique for reducing LLM inference latency. The approach uses a smaller draft model to propose token candidates that a larger model then verifies in parallel, enabling multiple tokens to be accepted per forward pass. The blog post explains the mechanism and demonstrates integration into the Hugging Face Transformers library.

6arXiv · cs.AI·1mo ago·source ↗

Graft: Hybrid Tree Construction for Speculative Decoding via Prune-Then-Retrieve

Graft is a training-free framework that improves speculative decoding by coupling dynamic-depth pruning with retrieval-based token compensation. Pruning reduces VRAM and compute overhead while freeing budget for retrieval, which fills topological gaps in the draft tree with near-zero additional cost. On short-context benchmarks, Graft achieves up to 5.41× speedup and improves average speedup over EAGLE-3 by up to 21.8% on Qwen3-235B. The method is evaluated across short- and long-context settings and extended to block-drafting paradigms.

6arXiv · cs.AI·18d ago·source ↗

SimSD: Speculative Decoding Adapted for Diffusion Language Models

SimSD introduces a training-free speculative decoding algorithm for diffusion large language models (dLLMs), which previously could not use standard token-level speculative decoding due to their bidirectional attention and masked language modeling formulation. The method uses a plug-and-play masking strategy that introduces reference tokens from a draft model and a custom attention mask, enabling valid logit computation for drafted tokens in a single forward pass. Evaluated on SDAR-family dLLMs across four benchmarks, SimSD achieves up to 7.46x decoding throughput improvement while maintaining or improving generation quality. The approach is compatible with other acceleration techniques such as KV cache and blockwise decoding.

5arXiv · cs.CL·9d ago·source ↗

VIA-SD: Multi-tier speculative decoding via intra-model routing cuts rejection rates and boosts inference speed

VIA-SD introduces a three-tier verification framework for speculative decoding that routes draft tokens to a lightweight 'slim verifier' submodel for medium-confidence cases, reserving full-model verification only for uncertain tokens. Across four tasks and multiple model families, the method reduces rejection rates by 0.10–0.22 and achieves 10–20% speedups over strong speculative decoding baselines, with 2.5–3x acceleration over standard decoding. The approach is compatible with existing speculative decoding frameworks without retraining. The work proposes multi-tier speculative decoding as a general paradigm for scalable LLM inference.