Universal Assisted Generation: Faster Decoding with Any Assistant Model
Hugging Face introduces Universal Assisted Generation (UAG), a technique that extends speculative decoding to work with any assistant model regardless of tokenizer or vocabulary differences. The approach enables using smaller, mismatched draft models to accelerate inference of larger target models, removing the previous constraint that both models share the same tokenizer. This broadens the practical applicability of speculative decoding across the open-weights ecosystem.
Related guides (3)
Related events (8)
Assisted Generation: a new direction toward low-latency text generation
Hugging Face introduces assisted generation (speculative decoding) as a practical technique for reducing LLM inference latency. The approach uses a smaller draft model to propose token candidates that a larger model then verifies in parallel, enabling multiple tokens to be accepted per forward pass. The blog post explains the mechanism and demonstrates integration into the Hugging Face Transformers library.
Faster Assisted Generation with Dynamic Speculation
Hugging Face introduces dynamic speculation lookahead for assisted (speculative) decoding, a technique that adaptively adjusts the number of candidate tokens generated by a draft model before verification by the main model. This approach aims to improve throughput and reduce latency compared to fixed-lookahead speculative decoding by tuning the speculation depth at runtime. The blog post describes the method and its integration into the Hugging Face Transformers library.
Faster Assisted Generation Support for Intel Gaudi
Hugging Face has published a blog post detailing assisted generation (speculative decoding) support optimized for Intel Gaudi accelerators. The post covers implementation details and performance improvements achieved by running assisted/speculative decoding on Gaudi hardware. This represents an infrastructure and inference optimization development relevant to non-NVIDIA AI accelerator deployment.
Speculative Decoding for 2x Faster Whisper Inference
Hugging Face demonstrates applying speculative decoding to OpenAI's Whisper speech recognition model, achieving approximately 2x inference speedup. The technique uses a smaller draft model to propose token sequences that the larger target model then verifies, reducing the number of full forward passes required. This post covers implementation details using the Hugging Face Transformers library and benchmarks the approach across different hardware configurations.
Faster Text Generation with Self-Speculative Decoding via LayerSkip
This Hugging Face blog post covers LayerSkip, a self-speculative decoding technique that accelerates text generation by using early exit from transformer layers to draft tokens, then verifying them with the full model. Unlike standard speculative decoding, LayerSkip requires no separate draft model, reducing memory overhead while still achieving inference speedups. The post likely covers integration with the Hugging Face ecosystem and practical performance benchmarks.
ADAS: Attention-Discounted Adaptive Sampler improves parallel decoding for masked diffusion language models
Researchers propose ADAS, a training-free reranking rule for masked diffusion language model decoding that addresses token interaction failures in parallel token commitment. The method greedily penalizes candidates that attend strongly to already-selected uncertain positions, using attention weights as soft marginal penalties rather than hard constraints. Evaluated on LLaDA-8B-Base and Dream-7B-Base across GSM8K, MATH500, HumanEval, and MBPP, ADAS improves low-NFE performance by 9–10 percentage points on average when plugged into existing samplers with only 3.1% runtime overhead.
Remote VAEs for Decoding with Hugging Face Inference Endpoints
Hugging Face introduces Remote VAEs, a feature for Inference Endpoints that offloads the VAE decoding step of diffusion models to a separate remote service. This approach reduces GPU memory pressure on the primary inference host by decoupling the computationally expensive decoding stage. The pattern is relevant for large latent diffusion models where VAE decoding can be a significant memory and compute bottleneck.
Accelerating Qwen3-8B Agent on Intel Core Ultra with Depth-Pruned Draft Models
Hugging Face and Intel demonstrate speculative decoding acceleration for the Qwen3-8B model on Intel Core Ultra client hardware using depth-pruned draft models. The approach applies structured pruning to create a smaller draft model that enables speculative decoding, targeting on-device agent workloads. This work addresses inference efficiency for mid-size open-weight models on consumer-grade x86 silicon.


