Almanac
Concept guide · Beginner

Speculative Decoding: Making AI Faster Without Changing the Answer

speculative decodingBeginneractive·v1 · live·generated 6d ago
TL;DRSpeculative decoding is a clever trick that makes AI text generation significantly faster without changing the quality of the output. It works by having a small, cheap model guess ahead, then letting the full model check those guesses in one efficient step — accepting the right ones and discarding the rest. The technique has spread from large language models to speech recognition and even diffusion-based AI, and researchers keep finding new ways to push it further.

Key takeaways

  • The core trick: a small "draft" model proposes several tokens at once; the big model verifies them all in a single forward pass, accepting correct guesses and skipping wasted work.
  • Applied to Whisper speech recognition, speculative decoding achieved roughly 2x inference speedup with no quality loss.
  • LayerSkip removes the need for a separate draft model entirely — it drafts using early exits from the main model's own layers, saving memory.
  • SimSD extended speculative decoding to diffusion language models, a class that previously couldn't use it, reaching up to 7.46x throughput improvement.
  • Hugging Face's Universal Assisted Generation removed the requirement that the draft and target models share the same vocabulary, opening the technique to far more model pairings.
  • Dynamic speculation adjusts how many tokens the draft model guesses at runtime, squeezing out more speed than a fixed lookahead.

What speculative decoding is

Imagine you're proofreading a long document. Instead of reading every word yourself from scratch, a fast assistant pre-reads it and highlights the parts they're confident about. You glance over those sections in bulk, accept the ones that look right, and only slow down where the assistant was uncertain. You finish much faster, and the result is exactly what you would have produced on your own.

That's speculative decoding in a nutshell. A small, fast "draft" model proposes several tokens (words or word-pieces) ahead. The large, accurate "target" model then checks all those proposals in a single step — accepting the ones it agrees with and stopping at the first one it doesn't. Because the big model can verify a batch of guesses in roughly the same time it would take to generate just one token on its own, you get multiple tokens of output for the price of one verification pass.

The output is mathematically identical to what the big model would have produced alone. There's no quality tradeoff — only a speed gain.

Why it matters

Running large AI models is expensive and slow. Every word a chatbot types, every line of code it suggests, every sentence it transcribes from audio requires a full trip through billions of parameters. Speculative decoding attacks this bottleneck directly, without requiring a smaller or less capable model to be deployed instead.

Hugging Face first integrated the technique into its Transformers library in 2023, making it accessible to anyone using open-weight models. Shortly after, the same approach was applied to OpenAI's Whisper speech recognition model, delivering roughly 2x faster transcription with no change in accuracy.

How it works (the plain version)

1. Draft: A small, cheap model generates a short sequence of candidate tokens — say, five words it thinks come next. 2. Verify: The big model looks at all five candidates at once in a single forward pass. 3. Accept or reject: Any candidate the big model agrees with gets accepted. The first disagreement triggers a stop, and the big model supplies the correct token there. 4. Repeat: The process starts again from wherever it left off.

The more often the draft model guesses correctly, the bigger the speedup. A good draft model on a predictable task can achieve acceptance rates as high as 95%, according to research on the Bebop framework.

Variants: one idea, many shapes

The core idea has branched in several directions:

Self-speculative decoding (LayerSkip) eliminates the need for a separate draft model entirely. Instead, the main model "exits early" from its own layers to produce a draft, then completes the full pass to verify. This saves memory — you only load one model — while still achieving meaningful speedups.

Universal Assisted Generation, introduced by Hugging Face, removed a practical barrier: previously, the draft and target models had to share the same vocabulary. Now any smaller model can serve as a draft, regardless of how it was trained, opening the technique to far more combinations of open-weight models.

Dynamic speculation goes one step further by adjusting at runtime how many tokens the draft model attempts before verification. Fixed lookahead wastes time when the draft is likely to be wrong; dynamic lookahead tunes itself to the difficulty of each moment in the generation.

Multi-tier verification (VIA-SD) adds a middle layer: a "slim verifier" submodel handles medium-confidence draft tokens, reserving the full model only for the genuinely uncertain ones. This reduces the number of expensive full-model calls and achieves 2.5–3x acceleration over standard decoding.

Tree-based drafting (Graft) builds a branching tree of possible continuations rather than a single sequence, then prunes low-probability branches and fills gaps using retrieval. On large models like Qwen3-235B, this approach improved average speedup by up to 21.8% over a strong prior method called EAGLE-3.

Beyond text: speech and diffusion models

Speculative decoding was originally designed for autoregressive text models — ones that generate one token at a time, left to right. Researchers have since extended it to other architectures.

For speech recognition, the Whisper demo showed the technique transfers cleanly: a smaller Whisper model drafts transcription tokens, the full model verifies, and the result is about twice as fast.

For diffusion language models — a newer class of AI that generates text by iteratively refining a masked sequence rather than predicting left to right — standard speculative decoding didn't work at all, because the architecture processes tokens bidirectionally. SimSD solved this with a custom attention mask that lets a draft model's proposals be checked in a single forward pass, achieving up to 7.46x throughput improvement on the models tested.

Where it's heading

Speculative decoding has moved from a research curiosity to a standard inference optimization, shipping in production libraries and running on hardware ranging from Intel Xeon CPUs to Intel Gaudi accelerators to consumer laptops. The active research frontier is about squeezing more acceptance rate out of the draft step — through better draft models, smarter trees, retrieval augmentation, and tighter integration with reinforcement learning training pipelines. The technique is no longer just a trick for saving time; it's becoming infrastructure.

How speculative decoding works

Speculative decoding variants at a glance

VariantDraft sourceKey benefitNotable result
Standard (assisted generation)Separate smaller modelMultiple tokens per passPractical LLM speedup
LayerSkip (self-speculative)Early exit from main model's own layersNo extra model in memoryReduced memory overhead
Universal Assisted GenerationAny smaller model, any vocabularyWorks across mismatched tokenizersBroader open-weights compatibility
SimSDDraft model + custom attention maskWorks on diffusion LLMsUp to 7.46x throughput
GraftPruned tree + retrieval fillBetter draft tree qualityUp to 5.41x speedup, +21.8% over EAGLE-3
VIA-SD (multi-tier)Slim verifier submodel for mid-confidence tokensFewer full-model verification calls2.5–3x over standard decoding

All figures from the events bundle; unknown cells render —.

Timeline

  1. Hugging Face introduces assisted generation (speculative decoding) in Transformers

  2. Speculative decoding applied to Whisper for ~2x speech recognition speedup

  3. Universal Assisted Generation removes tokenizer-matching requirement

  4. LayerSkip self-speculative decoding: draft from early exits, no separate model

  5. SimSD brings speculative decoding to diffusion language models (up to 7.46x)

Related topics

Hugging FaceHugging Face TransformersWhisperAssisted GenerationGraftKV CacheIntelOptimum-Intel

FAQ

Does speculative decoding change what the AI says?

No — the verification step guarantees that only tokens the full model would have chosen anyway are accepted, so the output is identical to running the big model alone.

Do I need a second, separate model to use this?

Not always. LayerSkip drafts using the main model's own early layers, and Universal Assisted Generation lets you pair any smaller model even if it has a different vocabulary.

Where does the speedup actually come from?

Large AI models are slow because each token requires a full pass through billions of parameters. Speculative decoding lets the big model check several draft tokens in one pass, so you get multiple tokens for the cost of one verification step.

Is this only for text generation?

No — it has been applied to speech recognition (Whisper) and, more recently, to diffusion-based language models, with speedups reported in both cases.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live6d ago

Related guides (4)

More on speculative decoding (6)

5Hugging Face Blog·1mo ago·source ↗

Speculative Decoding for 2x Faster Whisper Inference

Hugging Face demonstrates applying speculative decoding to OpenAI's Whisper speech recognition model, achieving approximately 2x inference speedup. The technique uses a smaller draft model to propose token sequences that the larger target model then verifies, reducing the number of full forward passes required. This post covers implementation details using the Hugging Face Transformers library and benchmarks the approach across different hardware configurations.

5Hugging Face Blog·1mo ago·source ↗

Faster Text Generation with Self-Speculative Decoding via LayerSkip

This Hugging Face blog post covers LayerSkip, a self-speculative decoding technique that accelerates text generation by using early exit from transformer layers to draft tokens, then verifying them with the full model. Unlike standard speculative decoding, LayerSkip requires no separate draft model, reducing memory overhead while still achieving inference speedups. The post likely covers integration with the Hugging Face ecosystem and practical performance benchmarks.

5Hugging Face Blog·1mo ago·source ↗

Assisted Generation: a new direction toward low-latency text generation

Hugging Face introduces assisted generation (speculative decoding) as a practical technique for reducing LLM inference latency. The approach uses a smaller draft model to propose token candidates that a larger model then verifies in parallel, enabling multiple tokens to be accepted per forward pass. The blog post explains the mechanism and demonstrates integration into the Hugging Face Transformers library.

6arXiv · cs.AI·1mo ago·source ↗

Graft: Hybrid Tree Construction for Speculative Decoding via Prune-Then-Retrieve

Graft is a training-free framework that improves speculative decoding by coupling dynamic-depth pruning with retrieval-based token compensation. Pruning reduces VRAM and compute overhead while freeing budget for retrieval, which fills topological gaps in the draft tree with near-zero additional cost. On short-context benchmarks, Graft achieves up to 5.41× speedup and improves average speedup over EAGLE-3 by up to 21.8% on Qwen3-235B. The method is evaluated across short- and long-context settings and extended to block-drafting paradigms.

6arXiv · cs.AI·18d ago·source ↗

SimSD: Speculative Decoding Adapted for Diffusion Language Models

SimSD introduces a training-free speculative decoding algorithm for diffusion large language models (dLLMs), which previously could not use standard token-level speculative decoding due to their bidirectional attention and masked language modeling formulation. The method uses a plug-and-play masking strategy that introduces reference tokens from a draft model and a custom attention mask, enabling valid logit computation for drafted tokens in a single forward pass. Evaluated on SDAR-family dLLMs across four benchmarks, SimSD achieves up to 7.46x decoding throughput improvement while maintaining or improving generation quality. The approach is compatible with other acceleration techniques such as KV cache and blockwise decoding.

5arXiv · cs.CL·9d ago·source ↗

VIA-SD: Multi-tier speculative decoding via intra-model routing cuts rejection rates and boosts inference speed

VIA-SD introduces a three-tier verification framework for speculative decoding that routes draft tokens to a lightweight 'slim verifier' submodel for medium-confidence cases, reserving full-model verification only for uncertain tokens. Across four tasks and multiple model families, the method reduces rejection rates by 0.10–0.22 and achieves 10–20% speedups over strong speculative decoding baselines, with 2.5–3x acceleration over standard decoding. The approach is compatible with existing speculative decoding frameworks without retraining. The work proposes multi-tier speculative decoding as a general paradigm for scalable LLM inference.