DiT-Reward converts text-to-image diffusion transformers into reward models, outperforming HPSv3
DiT-Reward is a new reward modeling approach that repurposes pretrained text-to-image Diffusion Transformers (DiTs) by processing near-clean image latents and aggregating text-conditioned representations across transformer layers. Under matched training data, it outperforms HPSv3 on all four evaluated preference benchmarks, reaching 85.6% on HPDv2 and 77.6% on HPDv3. When used to optimize Stable Diffusion 3.5 Large via Flow-GRPO, it shows clear gains in realism and achieves a 1.65x inference speedup over HPSv3. The work demonstrates that generative DiT representations transfer meaningfully to reward modeling and policy optimization.
Related guides (3)
Related events (8)
Finetune Stable Diffusion Models with DDPO via TRL
Hugging Face's TRL library adds support for DDPO (Denoising Diffusion Policy Optimization), enabling reinforcement learning-based finetuning of Stable Diffusion models. This extends TRL's RLHF tooling beyond language models to image generation, allowing reward-driven optimization of diffusion models. The post demonstrates practical usage of the new DDPO trainer within the TRL ecosystem.
Drifting Preference Optimization (DrPO) for One-Step Text-to-Image Generators
DrPO is a new online preference fine-tuning method designed specifically for deterministic one-step text-to-image generators like SD-Turbo and SDXL-Turbo, which are difficult to align with standard RLHF methods that require policy likelihoods or differentiable reward gradients. The method samples candidates per prompt, ranks them with a target reward, and synthesizes a feature-space update direction via a non-parametric dipole preference field plus a reference drift from the frozen base model. Because the reward is used only for ranking, DrPO supports black-box and non-differentiable reward functions while keeping inference as a single forward pass. Evaluations on HPSv3 and GenEval show improved alignment over reward-gradient-free baselines and a 3.51× reduction in training compute by eliminating reward-model backpropagation.
TunerDiT: Training-free Progressive Steering of Diffusion Transformers for Multi-Event Video Generation
TunerDiT is a training-free method for steering video diffusion transformers (DiTs) to generate long-horizon videos containing multiple sequential events. The approach identifies intrinsic turning points in the DiT denoising trajectory where text conditioning shifts from global layout to fine-grained detail, then applies two steering mechanisms: Event-Partitioned Masking and Cross-Event Prompt Fusion. The authors also introduce Meve, a benchmark prompt suite for multi-event video generation, and report state-of-the-art results across 8 metrics with improved text alignment scaling with event count.
In-Context Reward Adaptation for Robust Preference Modeling
This paper proposes In-Context Reward Adaptation (ICRA), a transformer-based framework that infers reward structures from small sets of preference demonstrations at inference time, without retraining. The key finding is that standard transformers exhibit asymptotic bias toward ground-truth rewards, but incorporating human response time as an auxiliary signal resolves this limitation and enables generalization to unseen preference domains. The approach addresses a core limitation of static RLHF reward models, which fail to handle heterogeneous or shifting human value distributions.
DelTA: Discriminative Token Credit Assignment for RLVR Training
DelTA introduces a discriminative token credit assignment method for reinforcement learning from verifiable rewards (RLVR) that addresses the problem of high-frequency formatting tokens dominating policy gradient updates. The method estimates per-token coefficients to amplify side-specific gradient directions and downweight shared or weakly discriminative ones, making the effective update direction more contrastive. On seven mathematical benchmarks, DelTA outperforms same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base respectively, with additional gains on code generation tasks.
DistIL: Distributional DAgger for RL from Rich Feedback beyond single-bit rewards
A new arXiv preprint introduces DistIL, a distributional variant of the DAgger imitation learning algorithm designed to exploit rich feedback signals (execution traces, tool outputs, expert corrections) rather than the single-bit correctness reward used in standard RLVR. The method uses a forward cross-entropy objective that provides monotonic policy improvement guarantees, unlike reverse KL or Jensen-Shannon divergence objectives used in prior self-distillation approaches. Empirically, DistIL outperforms RLVR and self-distillation baselines on scientific reasoning, coding, and hard math benchmarks.
Representation-Conditioned Diffusion Models for Controllable Image Generation
This paper explores conditioning diffusion models on representations from pre-trained self-supervised models as an alternative to text prompts or semantic maps, which require large annotated datasets. The self-conditioning mechanism improves unconditional image generation quality and provides a controllable representation space. The authors identify directions of variation in this space and demonstrate smoothness and disentanglement properties, suggesting potential for fine-grained generative control without heavy annotation overhead.
Diffusers welcomes Stable Diffusion 3
Hugging Face's Diffusers library adds support for Stable Diffusion 3, enabling users to run Stability AI's latest text-to-image model through the standard Diffusers API. The post covers integration details, usage patterns, and memory optimization techniques for running SD3 locally. This marks the open-weights availability of SD3 through a major ML tooling ecosystem.


