Almanac
← Events
4Hugging Face Blog·1mo ago

Powerful ASR + Diarization + Speculative Decoding with Hugging Face Inference Endpoints

Hugging Face published a blog post describing a pipeline that combines automatic speech recognition (ASR), speaker diarization, and speculative decoding on their Inference Endpoints platform. The post demonstrates how these three techniques can be integrated to produce faster, speaker-attributed transcriptions. Speculative decoding is highlighted as a key inference optimization that reduces latency for ASR workloads.

Related guides (3)

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Speculative Decoding for 2x Faster Whisper Inference

Hugging Face demonstrates applying speculative decoding to OpenAI's Whisper speech recognition model, achieving approximately 2x inference speedup. The technique uses a smaller draft model to propose token sequences that the larger target model then verifies, reducing the number of full forward passes required. This post covers implementation details using the Hugging Face Transformers library and benchmarks the approach across different hardware configurations.

5Hugging Face Blog·1mo ago·source ↗

Faster Assisted Generation with Dynamic Speculation

Hugging Face introduces dynamic speculation lookahead for assisted (speculative) decoding, a technique that adaptively adjusts the number of candidate tokens generated by a draft model before verification by the main model. This approach aims to improve throughput and reduce latency compared to fixed-lookahead speculative decoding by tuning the speculation depth at runtime. The blog post describes the method and its integration into the Hugging Face Transformers library.

4Hugging Face Blog·1mo ago·source ↗

Deploying Speech-to-Speech on Hugging Face

Hugging Face published a guide on deploying speech-to-speech (S2S) pipelines using their Inference Endpoints infrastructure. The post covers the technical setup for combining speech recognition, language model inference, and text-to-speech components into a unified real-time pipeline. This represents a practical deployment pattern for voice-based AI applications on managed cloud infrastructure.

5Hugging Face Blog·1mo ago·source ↗

Faster Text Generation with Self-Speculative Decoding via LayerSkip

This Hugging Face blog post covers LayerSkip, a self-speculative decoding technique that accelerates text generation by using early exit from transformer layers to draft tokens, then verifying them with the full model. Unlike standard speculative decoding, LayerSkip requires no separate draft model, reducing memory overhead while still achieving inference speedups. The post likely covers integration with the Hugging Face ecosystem and practical performance benchmarks.

4Hugging Face Blog·1mo ago·source ↗

Remote VAEs for Decoding with Hugging Face Inference Endpoints

Hugging Face introduces Remote VAEs, a feature for Inference Endpoints that offloads the VAE decoding step of diffusion models to a separate remote service. This approach reduces GPU memory pressure on the primary inference host by decoupling the computationally expensive decoding stage. The pattern is relevant for large latent diffusion models where VAE decoding can be a significant memory and compute bottleneck.

5Hugging Face Blog·1mo ago·source ↗

Assisted Generation: a new direction toward low-latency text generation

Hugging Face introduces assisted generation (speculative decoding) as a practical technique for reducing LLM inference latency. The approach uses a smaller draft model to propose token candidates that a larger model then verifies in parallel, enabling multiple tokens to be accepted per forward pass. The blog post explains the mechanism and demonstrates integration into the Hugging Face Transformers library.

4Github Trending·25d ago·source ↗

FunASR: Industrial-Grade Speech Recognition Toolkit with 170x Realtime Performance

FunASR is an open-source speech recognition toolkit from ModelScope supporting 50+ languages, speaker diarization, emotion detection, and streaming inference at 170x realtime speed. It exposes an OpenAI-compatible API, positioning it as a drop-in alternative for production ASR workloads. The repository has accumulated 16,317 stars with modest daily momentum (+42 today).

4Hugging Face Blog·1mo ago·source ↗

Blazingly Fast Whisper Transcriptions with Inference Endpoints

Hugging Face published a blog post detailing optimized Whisper speech-to-text transcription deployments via their Inference Endpoints service. The post covers performance improvements using faster-whisper or similar optimized backends to achieve significantly reduced transcription latency. This is positioned as a practical deployment guide for production speech recognition workloads.