Almanac
← Events
5arXiv cs.AI (Artificial Intelligence)·1mo ago

CARV: Compute-Aware Variance Reduction for Diffusion Teacher Gradient Estimation

CARV is a hierarchical Monte Carlo estimation framework that reduces gradient variance when using frozen pretrained diffusion models as teachers in downstream pipelines such as text-to-3D distillation and data attribution. The approach amortizes expensive upstream computation (rendering, simulation, encoding) over cheap diffusion-noise resamples, augmented by timestep importance sampling and stratified-inverse-CDF construction. In text-to-3D experiments, CARV delivers 2–3× effective compute multipliers; in single-step distillation, it cuts gradient variance by an order of magnitude but does not improve FID, revealing that MC variance is not the bottleneck in that regime.

Related guides (2)

Related events (8)

5arXiv · cs.LG·3d ago·source ↗

Kolmogorov Regression lifts diffusion policies to Cameron-Martin space for robust long-horizon control

Researchers introduce a backward Kolmogorov equation framework that reformulates diffusion policy training as a deterministic boundary-value PDE problem in Cameron-Martin space, replacing stochastic score matching. The approach uses a precision-weighted Cameron-Martin loss and a Kolmogorov residual as an inference-time failure detector, yielding convergence guarantees tied to kernel effective rank rather than action dimension. Validation on the PushT manipulation benchmark shows 17% improvement in episode reward and 67.6% reduction in inter-step drift; a 6-station manufacturing scheduling task shows 28.4% lower RMSE than LSTM baselines and 96% reduction in deadlock events via Hamilton-Jacobi reachability certification.

4Hugging Face Blog·1mo ago·source ↗

Remote VAEs for Decoding with Hugging Face Inference Endpoints

Hugging Face introduces Remote VAEs, a feature for Inference Endpoints that offloads the VAE decoding step of diffusion models to a separate remote service. This approach reduces GPU memory pressure on the primary inference host by decoupling the computationally expensive decoding stage. The pattern is relevant for large latent diffusion models where VAE decoding can be a significant memory and compute bottleneck.

6arXiv · cs.AI·22d ago·source ↗

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

VideoMLA applies Multi-Head Latent Attention (MLA) to causal video diffusion, replacing per-head keys and values with a shared low-rank content latent and decoupled 3D-RoPE positional key, achieving 92.7% reduction in per-token KV memory. The paper investigates why MLA works despite pretrained video attention not being low-rank (unlike the spectral assumption motivating MLA in LLMs), finding that the MLA bottleneck itself determines effective rank rather than the pretrained spectrum. On VBench, VideoMLA matches short-horizon baselines, achieves best overall score at long horizons, and delivers 1.23x throughput improvement on a single NVIDIA B200 GPU.

5The Batch·17d ago·source ↗

Apple researchers propose Feature Auto-Encoder to speed diffusion training via compressed DINOv2 embeddings

Researchers at Apple introduced Feature Auto-Encoder (FAE), a latent diffusion image generator that compresses DINOv2 vision encoder embeddings before learning to denoise them, then expands them back for decoding. The approach achieves comparable image quality to state-of-the-art diffusion models while training roughly 7x faster on ImageNet class-conditional generation. The key insight is that shrinking semantically rich vision embeddings reduces compute during diffusion training without sacrificing the representational benefits of large pretrained encoders.

6arXiv · cs.LG·1mo ago·source ↗

SURGE: Approximation-free Training-Free Particle Filter for Diffusion Surrogate

The paper introduces URGE (Unbiased Resampling via Girsanov Estimation), a derivative-free inference-time scaling algorithm for diffusion models that performs path-wise importance reweighting using a Girsanov change of measure. Unlike existing inference-time guidance methods, URGE requires no score, Hessian, or PDE evaluations, attaching multiplicative weights to simulated trajectories and periodically resampling. The authors establish a theoretical equivalence between path-wise and particle-wise sequential Monte Carlo (SMC), guaranteeing unbiased terminal distributions. Empirically, URGE outperforms existing inference-time guidance baselines on synthetic tests and diffusion-model benchmarks while being simpler to implement.

7arXiv · cs.LG·19d ago·source ↗

RayDer: Scalable Self-Supervised Novel View Synthesis via Unified Feed-Forward Transformer

RayDer is a unified feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone for self-supervised novel view synthesis (NVS). By treating dynamic content as a nuisance factor absorbed by a minimal dynamic state, it enables stable training on unconstrained real-world video without requiring dynamic-scene reconstruction. The model exhibits clean power-law scaling with both data and compute across multiple model sizes, and achieves zero-shot open-set performance competitive with supervised state-of-the-art methods on multiple benchmarks.

5arXiv · cs.CL·24d ago·source ↗

DIVE: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation

DIVE is a frozen-backbone distillation framework that addresses a fundamental limitation in token-level in-context vector distillation: uniform cross-entropy supervision treats all output tokens equally, but long-form outputs like medical reports are dominated by low-information template tokens while diagnostically critical tokens receive insufficient gradient signal. The method introduces decisive-token supervision (upweighting pathology-related tokens and EOS events) and state-conditioned dynamic steering (hidden-state-dependent adapters replacing fixed residuals) to correct supervision imbalance and autoregressive drift. Evaluated on MIMIC-CXR and CheXpert Plus with two medical VLM backbones, DIVE achieves best BLEU-4, ROUGE-L, and RadGraph F1 across all dataset-backbone combinations while remaining competitive on CheXbert F1.

6arXiv · cs.AI·25d ago·source ↗

Channel-wise Vector Quantization (CVQ): A New Image Tokenization Paradigm with Next-Channel Prediction

Researchers introduce Channel-wise Vector Quantization (CVQ), which replaces conventional patch-wise discrete tokens with channel-wise tokens that represent an image as discrete levels of visual detail. Built on CVQ, the Channel-wise Autoregressive (CAR) model uses a 'next-channel prediction' objective, generating images by progressively refining from global structure to fine-grained attributes. CVQ achieves 100% codebook utilization with a 16K+ codebook and the CAR model scores 86.7 on DPG and 0.79 on GenEval for text-to-image generation. The approach offers a structural alternative to raster-order patch-based autoregressive image generation.