5arXiv cs.AI (Artificial Intelligence)·19d ago

TunerDiT: Training-free Progressive Steering of Diffusion Transformers for Multi-Event Video Generation

TunerDiT is a training-free method for steering video diffusion transformers (DiTs) to generate long-horizon videos containing multiple sequential events. The approach identifies intrinsic turning points in the DiT denoising trajectory where text conditioning shifts from global layout to fine-grained detail, then applies two steering mechanisms: Event-Partitioned Masking and Cross-Event Prompt Fusion. The authors also introduce Meve, a benchmark prompt suite for multi-event video generation, and report state-of-the-art results across 8 metrics with improved text alignment scaling with event count.

Evaluation and Benchmarking Inference Economics Multimodal Progress Meve TunerDiT Event-Partitioned Masking Cross-Event Prompt Fusion Linear Diffusion Transformer

Related guides (3)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·12d ago·source ↗

DirectAudioEdit: Training-free, inversion-free text-guided audio editing via diffusion prediction contrast

Researchers introduce DirectAudioEdit, the first training-free and inversion-free method for text-guided audio editing using diffusion denoising dynamics. The approach constructs a source-to-target editing path without requiring DDPM inversion, reducing macro-averaged FAD and KL divergence by ~16% compared to inversion-based baselines while achieving up to 64.5% speedup. Experiments span music and event-level benchmarks across two backbone architectures.

Multimodal Progress DirectAudioEdit DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

4Hugging Face Blog·1mo ago·source ↗

Instruction-tuning Stable Diffusion with InstructPix2Pix

This Hugging Face blog post describes a methodology for instruction-tuning Stable Diffusion using the InstructPix2Pix framework, enabling image editing via natural language instructions. The approach adapts techniques from language model instruction-tuning to the image generation domain. The post covers dataset construction, training procedures, and evaluation of the resulting models.

Alignment and RLHF Multimodal Progress Stable Diffusion 3 InstructPix2Pix Hugging Face +1 more

7arXiv · cs.LG·19d ago·source ↗

RayDer: Scalable Self-Supervised Novel View Synthesis via Unified Feed-Forward Transformer

RayDer is a unified feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone for self-supervised novel view synthesis (NVS). By treating dynamic content as a nuisance factor absorbed by a minimal dynamic state, it enables stable training on unconstrained real-world video without requiring dynamic-scene reconstruction. The model exhibits clean power-law scaling with both data and compute across multiple model sizes, and achieves zero-shot open-set performance competitive with supervised state-of-the-art methods on multiple benchmarks.

Training Infrastructure Frontier Model Releases feed-forward transformer power-law scaling CompVis +4 more

6arXiv · cs.LG·18d ago·source ↗

Drifting Preference Optimization (DrPO) for One-Step Text-to-Image Generators

DrPO is a new online preference fine-tuning method designed specifically for deterministic one-step text-to-image generators like SD-Turbo and SDXL-Turbo, which are difficult to align with standard RLHF methods that require policy likelihoods or differentiable reward gradients. The method samples candidates per prompt, ranks them with a target reward, and synthesizes a feature-space update direction via a non-parametric dipole preference field plus a reference drift from the frozen base model. Because the reward is used only for ranking, DrPO supports black-box and non-differentiable reward functions while keeping inference as a single forward pass. Evaluations on HPSv3 and GenEval show improved alignment over reward-gradient-free baselines and a 3.51× reduction in training compute by eliminating reward-model backpropagation.

Inference Economics Alignment and RLHF SDXL Turbo HPSv3 GenEval +4 more

5Hugging Face Blog·1mo ago·source ↗

Finetune Stable Diffusion Models with DDPO via TRL

Hugging Face's TRL library adds support for DDPO (Denoising Diffusion Policy Optimization), enabling reinforcement learning-based finetuning of Stable Diffusion models. This extends TRL's RLHF tooling beyond language models to image generation, allowing reward-driven optimization of diffusion models. The post demonstrates practical usage of the new DDPO trainer within the TRL ecosystem.

Agent and Tool Ecosystem Alignment and RLHF DDPO Denoising Diffusion Policy Optimization Stable Diffusion 3 +3 more

4Hugging Face Blog·1mo ago·source ↗

Training Stable Diffusion with Dreambooth using Diffusers

This Hugging Face blog post describes how to fine-tune Stable Diffusion models using the DreamBooth technique via the Diffusers library. DreamBooth enables personalized text-to-image generation by training a model on a small set of reference images. The post covers the technical workflow for applying this fine-tuning approach within the Diffusers ecosystem.

Open Weights Progress Agent and Tool Ecosystem Hugging Face Diffusers Stable Diffusion 3 Hugging Face +1 more

4Hugging Face Blog·1mo ago·source ↗

Build Awesome Datasets for Video Generation

Hugging Face published a blog post on constructing high-quality datasets for video generation models. The post likely covers data collection, preprocessing, and curation pipelines relevant to training video diffusion or generation systems. This is a practical tooling and methodology guide aimed at practitioners working on video AI.

Agent and Tool Ecosystem Multimodal Progress Hugging Face video generation

5arXiv · cs.CL·24d ago·source ↗

DIVE: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation

DIVE is a frozen-backbone distillation framework that addresses a fundamental limitation in token-level in-context vector distillation: uniform cross-entropy supervision treats all output tokens equally, but long-form outputs like medical reports are dominated by low-information template tokens while diagnostically critical tokens receive insufficient gradient signal. The method introduces decisive-token supervision (upweighting pathology-related tokens and EOS events) and state-conditioned dynamic steering (hidden-state-dependent adapters replacing fixed residuals) to correct supervision imbalance and autoregressive drift. Evaluated on MIMIC-CXR and CheXpert Plus with two medical VLM backbones, DIVE achieves best BLEU-4, ROUGE-L, and RadGraph F1 across all dataset-backbone combinations while remaining competitive on CheXbert F1.

Inference Economics Multimodal Progress State-Conditioned Dynamic Steering RadGraph F1 CheXbert F1 +5 more