What speculative decoding is
Imagine you're proofreading a long document. Instead of reading every word yourself from scratch, a fast assistant pre-reads it and highlights the parts they're confident about. You glance over those sections in bulk, accept the ones that look right, and only slow down where the assistant was uncertain. You finish much faster, and the result is exactly what you would have produced on your own.
That's speculative decoding in a nutshell. A small, fast "draft" model proposes several tokens (words or word-pieces) ahead. The large, accurate "target" model then checks all those proposals in a single step — accepting the ones it agrees with and stopping at the first one it doesn't. Because the big model can verify a batch of guesses in roughly the same time it would take to generate just one token on its own, you get multiple tokens of output for the price of one verification pass.
The output is mathematically identical to what the big model would have produced alone. There's no quality tradeoff — only a speed gain.
Why it matters
Running large AI models is expensive and slow. Every word a chatbot types, every line of code it suggests, every sentence it transcribes from audio requires a full trip through billions of parameters. Speculative decoding attacks this bottleneck directly, without requiring a smaller or less capable model to be deployed instead.
Hugging Face first integrated the technique into its Transformers library in 2023, making it accessible to anyone using open-weight models. Shortly after, the same approach was applied to OpenAI's Whisper speech recognition model, delivering roughly 2x faster transcription with no change in accuracy.
How it works (the plain version)
1. Draft: A small, cheap model generates a short sequence of candidate tokens — say, five words it thinks come next. 2. Verify: The big model looks at all five candidates at once in a single forward pass. 3. Accept or reject: Any candidate the big model agrees with gets accepted. The first disagreement triggers a stop, and the big model supplies the correct token there. 4. Repeat: The process starts again from wherever it left off.
The more often the draft model guesses correctly, the bigger the speedup. A good draft model on a predictable task can achieve acceptance rates as high as 95%, according to research on the Bebop framework.
Variants: one idea, many shapes
The core idea has branched in several directions:
Self-speculative decoding (LayerSkip) eliminates the need for a separate draft model entirely. Instead, the main model "exits early" from its own layers to produce a draft, then completes the full pass to verify. This saves memory — you only load one model — while still achieving meaningful speedups.
Universal Assisted Generation, introduced by Hugging Face, removed a practical barrier: previously, the draft and target models had to share the same vocabulary. Now any smaller model can serve as a draft, regardless of how it was trained, opening the technique to far more combinations of open-weight models.
Dynamic speculation goes one step further by adjusting at runtime how many tokens the draft model attempts before verification. Fixed lookahead wastes time when the draft is likely to be wrong; dynamic lookahead tunes itself to the difficulty of each moment in the generation.
Multi-tier verification (VIA-SD) adds a middle layer: a "slim verifier" submodel handles medium-confidence draft tokens, reserving the full model only for the genuinely uncertain ones. This reduces the number of expensive full-model calls and achieves 2.5–3x acceleration over standard decoding.
Tree-based drafting (Graft) builds a branching tree of possible continuations rather than a single sequence, then prunes low-probability branches and fills gaps using retrieval. On large models like Qwen3-235B, this approach improved average speedup by up to 21.8% over a strong prior method called EAGLE-3.
Beyond text: speech and diffusion models
Speculative decoding was originally designed for autoregressive text models — ones that generate one token at a time, left to right. Researchers have since extended it to other architectures.
For speech recognition, the Whisper demo showed the technique transfers cleanly: a smaller Whisper model drafts transcription tokens, the full model verifies, and the result is about twice as fast.
For diffusion language models — a newer class of AI that generates text by iteratively refining a masked sequence rather than predicting left to right — standard speculative decoding didn't work at all, because the architecture processes tokens bidirectionally. SimSD solved this with a custom attention mask that lets a draft model's proposals be checked in a single forward pass, achieving up to 7.46x throughput improvement on the models tested.
Where it's heading
Speculative decoding has moved from a research curiosity to a standard inference optimization, shipping in production libraries and running on hardware ranging from Intel Xeon CPUs to Intel Gaudi accelerators to consumer laptops. The active research frontier is about squeezing more acceptance rate out of the draft step — through better draft models, smarter trees, retrieval augmentation, and tighter integration with reinforcement learning training pipelines. The technique is no longer just a trick for saving time; it's becoming infrastructure.




