Remote VAEs for Decoding with Hugging Face Inference Endpoints
Hugging Face introduces Remote VAEs, a feature for Inference Endpoints that offloads the VAE decoding step of diffusion models to a separate remote service. This approach reduces GPU memory pressure on the primary inference host by decoupling the computationally expensive decoding stage. The pattern is relevant for large latent diffusion models where VAE decoding can be a significant memory and compute bottleneck.
Related guides (3)
Related events (8)
Deploy Embedding Models with Hugging Face Inference Endpoints
Hugging Face published a guide on deploying embedding models using their Inference Endpoints service. The post covers how to set up dedicated endpoints for embedding models, enabling scalable vector generation for downstream tasks like semantic search and retrieval-augmented generation. This is part of Hugging Face's broader push to make production deployment of specialized model types more accessible.
Powerful ASR + Diarization + Speculative Decoding with Hugging Face Inference Endpoints
Hugging Face published a blog post describing a pipeline that combines automatic speech recognition (ASR), speaker diarization, and speculative decoding on their Inference Endpoints platform. The post demonstrates how these three techniques can be integrated to produce faster, speaker-attributed transcriptions. Speculative decoding is highlighted as a key inference optimization that reduces latency for ASR workloads.
RefDecoder: Reference-Conditioned Video VAE Decoder for Enhanced Visual Generation
RefDecoder addresses an architectural asymmetry in latent diffusion models where denoising networks are heavily conditioned but decoders remain unconditional, causing detail loss and inconsistency. The approach injects high-fidelity reference image signals into the VAE decoding process via reference attention, with a lightweight image encoder mapping reference frames into high-dimensional tokens co-processed at each decoder up-sampling stage. Evaluated on Inter4K, WebVid, and Large Motion benchmarks, RefDecoder achieves up to +2.1dB PSNR over unconditional baselines and improves VBench I2V scores across subject consistency, background consistency, and overall quality. The module is plug-and-play, compatible with existing video generation systems including Wan 2.1 and VideoVAE+ without additional fine-tuning.
Deploy LLMs with Hugging Face Inference Endpoints
Hugging Face published a guide on deploying large language models using their Inference Endpoints service. The post covers how to set up scalable, production-ready LLM deployments with minimal infrastructure overhead. It targets developers looking to move from experimentation to hosted inference without managing raw compute.
Running Privacy-Preserving Inferences on Hugging Face Endpoints
Hugging Face has published a blog post describing the integration of Fully Homomorphic Encryption (FHE) with its Inference Endpoints service, enabling privacy-preserving ML inference where data remains encrypted throughout computation. The approach allows clients to send encrypted inputs to a hosted model without the server ever seeing plaintext data. This represents a practical deployment of FHE-based ML, a technique that has historically been too slow for production use but is gaining traction with recent optimizations.
VQ-Diffusion: Vector Quantized Diffusion Models on Hugging Face
This Hugging Face blog post introduces VQ-Diffusion, a text-to-image generation approach that combines vector quantization with diffusion models. The method operates in a discrete latent space defined by a VQ-VAE codebook, applying the diffusion process to token sequences rather than continuous pixel or latent representations. The post likely covers integration into the Hugging Face diffusers ecosystem and demonstrates generation capabilities.
Universal Assisted Generation: Faster Decoding with Any Assistant Model
Hugging Face introduces Universal Assisted Generation (UAG), a technique that extends speculative decoding to work with any assistant model regardless of tokenizer or vocabulary differences. The approach enables using smaller, mismatched draft models to accelerate inference of larger target models, removing the previous constraint that both models share the same tokenizer. This broadens the practical applicability of speculative decoding across the open-weights ecosystem.
Bringing Serverless GPU Inference to Hugging Face Users via Cloudflare Workers AI
Hugging Face and Cloudflare have partnered to bring serverless GPU inference to Hugging Face users through Cloudflare Workers AI. The integration allows developers to run Hugging Face models on Cloudflare's global edge network without managing GPU infrastructure. This represents an expansion of serverless inference options for the Hugging Face ecosystem, lowering the barrier to deploying ML models at scale.


