5The Batch (DeepLearning.AI)·17d ago

Apple researchers propose Feature Auto-Encoder to speed diffusion training via compressed DINOv2 embeddings

Researchers at Apple introduced Feature Auto-Encoder (FAE), a latent diffusion image generator that compresses DINOv2 vision encoder embeddings before learning to denoise them, then expands them back for decoding. The approach achieves comparable image quality to state-of-the-art diffusion models while training roughly 7x faster on ImageNet class-conditional generation. The key insight is that shrinking semantically rich vision embeddings reduces compute during diffusion training without sacrificing the representational benefits of large pretrained encoders.

Training Infrastructure Multimodal Progress DINOv2 Yuan Gao MS COCO SiT SigLIP 2 Jiatao Gu CC12M Apple ImageNet Feature Auto-Encoder

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Training InfrastructureTopic guide

Training Infrastructure: The Compute Arms Race Powering Modern AI

Read asBeginner In-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

Faster Stable Diffusion with Core ML on iPhone, iPad, and Mac

Hugging Face published a blog post detailing optimizations for running Stable Diffusion models via Core ML on Apple devices including iPhone, iPad, and Mac. The post covers techniques to accelerate on-device inference using Apple's neural engine and Core ML framework. This represents progress in deploying capable diffusion models at the edge without cloud dependency.

Inference Economics Multimodal Progress Stable Diffusion 3 Hugging Face Core ML +2 more

6The Batch·18d ago·source ↗

Apple's AToken: A Unified Multimodal Tokenizer and Encoder for Images, Videos, and 3D Objects

Apple researchers introduced AToken, a transformer model with a single 4D tokenizer and encoder-decoder architecture that handles images, videos, and 3D objects in a shared token space. The model is trained to both reconstruct and classify all three media types, using a pretrained SigLIP2 vision encoder extended to four dimensions with 4D Rotary Position Embedding. AToken approaches or matches specialized models on image classification (82.2% ImageNet), image generation (0.21 rFID), and 3D reconstruction (28.28 PSNR), while remaining competitive on video tasks. The work addresses a longstanding tension between generation-focused and classification-focused encoders by forcing embeddings to retain both fine visual detail and semantic content.

Frontier Model Releases Multimodal Progress FLUX.1-dev Rotary Position Embedding (RoPE)Jiasen Lu +8 more

5Hugging Face Blog·1mo ago·source ↗

Introducing Würstchen: Fast Diffusion for Image Generation

Hugging Face introduces Würstchen, a latent diffusion architecture designed for fast and efficient image generation. The model operates in a highly compressed latent space, reducing computational requirements significantly compared to standard diffusion models. It is being integrated into the Diffusers library, making it accessible for the broader community.

Open Weights Progress Inference Economics Hugging Face Würstchen latent diffusion +2 more

4Hugging Face Blog·1mo ago·source ↗

Accelerating SD Turbo and SDXL Turbo Inference with ONNX Runtime and Olive

This Hugging Face blog post details how to accelerate Stable Diffusion Turbo and SDXL Turbo inference using ONNX Runtime and Microsoft's Olive optimization toolkit. The post covers the workflow for converting and optimizing diffusion models for faster deployment. This is a practical inference optimization guide targeting practitioners deploying image generation models.

Inference Economics Agent and Tool Ecosystem Stable Diffusion Turbo SDXL Turbo Microsoft +3 more

4Hugging Face Blog·1mo ago·source ↗

Optimizing Stable Diffusion for Intel CPUs with NNCF and Hugging Face Optimum

This Hugging Face blog post details techniques for optimizing Stable Diffusion inference on Intel CPUs using Neural Network Compression Framework (NNCF) and the Optimum library. The workflow covers quantization and other compression methods to reduce latency and memory footprint on CPU hardware. This is relevant to the inference-economics and enterprise-deployment threads as it addresses running diffusion models without dedicated GPU hardware.

Inference Economics Enterprise Deployment Patterns Stable Diffusion 3 Hugging Face Hugging Face Optimum +2 more

5Hugging Face Blog·1mo ago·source ↗

Using Stable Diffusion with Core ML on Apple Silicon

Hugging Face published a guide on running Stable Diffusion models via Apple's Core ML framework on Apple Silicon hardware. The post covers converting diffusion model weights to Core ML format and integrating them into the Diffusers library for on-device inference. This represents an early effort to enable efficient local image generation on consumer Apple hardware without requiring cloud GPU resources.

Inference Economics Agent and Tool Ecosystem Hugging Face Diffusers Stable Diffusion 3 Hugging Face +2 more

4arXiv · cs.CL·12d ago·source ↗

DirectAudioEdit: Training-free, inversion-free text-guided audio editing via diffusion prediction contrast

Researchers introduce DirectAudioEdit, the first training-free and inversion-free method for text-guided audio editing using diffusion denoising dynamics. The approach constructs a source-to-target editing path without requiring DDPM inversion, reducing macro-averaged FAD and KL divergence by ~16% compared to inversion-based baselines while achieving up to 64.5% speedup. Experiments span music and event-level benchmarks across two backbone architectures.

Multimodal Progress DirectAudioEdit DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

5Hugging Face Blog·1mo ago·source ↗

Stable Diffusion XL on Mac with Advanced Core ML Quantization

Hugging Face details the process of running Stable Diffusion XL (SDXL) on Apple Silicon Macs using Core ML with advanced quantization techniques. The post covers how quantization reduces model size and memory requirements to make SDXL feasible on consumer Mac hardware. This represents a practical deployment advance for running large diffusion models at the edge on Apple devices.

Inference Economics Multimodal Progress quantization Stable Diffusion 3 Hugging Face +2 more