6arXiv cs.AI (Artificial Intelligence)·15h ago

DiT-Reward converts text-to-image diffusion transformers into reward models, outperforming HPSv3

DiT-Reward is a new reward modeling approach that repurposes pretrained text-to-image Diffusion Transformers (DiTs) by processing near-clean image latents and aggregating text-conditioned representations across transformer layers. Under matched training data, it outperforms HPSv3 on all four evaluated preference benchmarks, reaching 85.6% on HPDv2 and 77.6% on HPDv3. When used to optimize Stable Diffusion 3.5 Large via Flow-GRPO, it shows clear gains in realism and achieves a 1.65x inference speedup over HPSv3. The work demonstrates that generative DiT representations transfer meaningfully to reward modeling and policy optimization.

Evaluation and Benchmarking Alignment and RLHF Multimodal Progress Flow-GRPO HPSv3 HPDv2 HPDv3 Stable Diffusion 3.5 Large DiT-Reward

Related guides (3)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Finetune Stable Diffusion Models with DDPO via TRL

Hugging Face's TRL library adds support for DDPO (Denoising Diffusion Policy Optimization), enabling reinforcement learning-based finetuning of Stable Diffusion models. This extends TRL's RLHF tooling beyond language models to image generation, allowing reward-driven optimization of diffusion models. The post demonstrates practical usage of the new DDPO trainer within the TRL ecosystem.

Agent and Tool Ecosystem Alignment and RLHF DDPO Denoising Diffusion Policy Optimization Stable Diffusion 3 +3 more

6arXiv · cs.LG·21d ago·source ↗

Drifting Preference Optimization (DrPO) for One-Step Text-to-Image Generators

DrPO is a new online preference fine-tuning method designed specifically for deterministic one-step text-to-image generators like SD-Turbo and SDXL-Turbo, which are difficult to align with standard RLHF methods that require policy likelihoods or differentiable reward gradients. The method samples candidates per prompt, ranks them with a target reward, and synthesizes a feature-space update direction via a non-parametric dipole preference field plus a reference drift from the frozen base model. Because the reward is used only for ranking, DrPO supports black-box and non-differentiable reward functions while keeping inference as a single forward pass. Evaluations on HPSv3 and GenEval show improved alignment over reward-gradient-free baselines and a 3.51× reduction in training compute by eliminating reward-model backpropagation.

Inference Economics Alignment and RLHF SDXL Turbo HPSv3 GenEval +4 more

5arXiv · cs.AI·22d ago·source ↗

TunerDiT: Training-free Progressive Steering of Diffusion Transformers for Multi-Event Video Generation

TunerDiT is a training-free method for steering video diffusion transformers (DiTs) to generate long-horizon videos containing multiple sequential events. The approach identifies intrinsic turning points in the DiT denoising trajectory where text conditioning shifts from global layout to fine-grained detail, then applies two steering mechanisms: Event-Partitioned Masking and Cross-Event Prompt Fusion. The authors also introduce Meve, a benchmark prompt suite for multi-event video generation, and report state-of-the-art results across 8 metrics with improved text alignment scaling with event count.

Evaluation and Benchmarking Inference Economics Meve TunerDiT Event-Partitioned Masking +3 more

6arXiv · cs.LG·25d ago·source ↗

In-Context Reward Adaptation for Robust Preference Modeling

This paper proposes In-Context Reward Adaptation (ICRA), a transformer-based framework that infers reward structures from small sets of preference demonstrations at inference time, without retraining. The key finding is that standard transformers exhibit asymptotic bias toward ground-truth rewards, but incorporating human response time as an auxiliary signal resolves this limitation and enables generalization to unseen preference domains. The approach addresses a core limitation of static RLHF reward models, which fail to handle heterogeneous or shifting human value distributions.

Evaluation and Benchmarking Alignment and RLHF Transformers In-Context Reward Adaptation Reinforcement Learning from Human Feedback +2 more

6arXiv · cs.CL·1mo ago·source ↗

DelTA: Discriminative Token Credit Assignment for RLVR Training

DelTA introduces a discriminative token credit assignment method for reinforcement learning from verifiable rewards (RLVR) that addresses the problem of high-frequency formatting tokens dominating policy gradient updates. The method estimates per-token coefficients to amplify side-specific gradient directions and downweight shared or weakly discriminative ones, making the effective update direction more contrastive. On seven mathematical benchmarks, DelTA outperforms same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base respectively, with additional gains on code generation tasks.

Frontier Model Releases Evaluation and Benchmarking DelTA Qwen3-8B-Base policy gradient +5 more

6arXiv · cs.AI·19d ago·source ↗

DistIL: Distributional DAgger for RL from Rich Feedback beyond single-bit rewards

A new arXiv preprint introduces DistIL, a distributional variant of the DAgger imitation learning algorithm designed to exploit rich feedback signals (execution traces, tool outputs, expert corrections) rather than the single-bit correctness reward used in standard RLVR. The method uses a forward cross-entropy objective that provides monotonic policy improvement guarantees, unlike reverse KL or Jensen-Shannon divergence objectives used in prior self-distillation approaches. Empirically, DistIL outperforms RLVR and self-distillation baselines on scientific reasoning, coding, and hard math benchmarks.

Frontier Model Releases Alignment and RLHF DAgger DistIL Reinforcement Learning with Verifiable Rewards +1 more

5arXiv · cs.LG·27d ago·source ↗

Representation-Conditioned Diffusion Models for Controllable Image Generation

This paper explores conditioning diffusion models on representations from pre-trained self-supervised models as an alternative to text prompts or semantic maps, which require large annotated datasets. The self-conditioning mechanism improves unconditional image generation quality and provides a controllable representation space. The authors identify directions of variation in this space and demonstrate smoothness and disentanglement properties, suggesting potential for fine-grained generative control without heavy annotation overhead.

Frontier Model Releases Multimodal Progress Representation-Conditioned Diffusion Models Self-Supervised Learning Disentangled Representation Learning +1 more

6Hugging Face Blog·1mo ago·source ↗

Diffusers welcomes Stable Diffusion 3

Hugging Face's Diffusers library adds support for Stable Diffusion 3, enabling users to run Stability AI's latest text-to-image model through the standard Diffusers API. The post covers integration details, usage patterns, and memory optimization techniques for running SD3 locally. This marks the open-weights availability of SD3 through a major ML tooling ecosystem.

Open Weights Progress Agent and Tool Ecosystem Stable Diffusion 3 Hugging Face Stability AI +2 more