3Hugging Face Blog·17d ago

Direct Preference Optimization Beyond Chatbots

A Hugging Face blog post explores applications of Direct Preference Optimization (DPO) outside of conversational AI contexts. The post appears to survey or analyze how DPO, a technique for aligning language models with human preferences, can be applied to non-chatbot domains. The body content is unavailable, limiting assessment of specific claims or findings.

Alignment and RLHF Direct Preference Optimization (DPO)Hugging Face

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Direct Preference Optimization (DPO)Concept

Direct Preference Optimization (DPO): Aligning AI Without a Reward Model

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Preference Tuning LLMs with Direct Preference Optimization Methods

A Hugging Face blog post surveys Direct Preference Optimization (DPO) and related preference tuning methods for aligning large language models. The post covers the landscape of DPO variants and their practical application via the TRL library. It serves as a technical reference for practitioners implementing RLHF alternatives.

Agent and Tool Ecosystem Alignment and RLHF Reinforcement Learning from Human Feedback Direct Preference Optimization (DPO)Hugging Face +1 more

5Hugging Face Blog·1mo ago·source ↗

Preference Optimization for Vision Language Models

This Hugging Face blog post covers the application of Direct Preference Optimization (DPO) to vision-language models (VLMs). It likely discusses how preference learning techniques originally developed for text-only LLMs can be adapted to multimodal settings. The post addresses training methodology for aligning VLMs with human preferences across both visual and textual modalities.

Alignment and RLHF Multimodal Progress Direct Preference Optimization (DPO)Vision-Language Models Hugging Face

5Hugging Face Blog·1mo ago·source ↗

Fine-tune Llama 2 with DPO

This Hugging Face blog post provides a practical guide to fine-tuning Llama 2 using Direct Preference Optimization (DPO) via the TRL library. It covers the alignment technique that bypasses the need for a separate reward model compared to RLHF, walking through dataset preparation, training configuration, and implementation details. The post targets practitioners looking to apply preference-based alignment to open-weights models.

Open Weights Progress Agent and Tool Ecosystem Meta AI Llama 2 Direct Preference Optimization (DPO)+3 more

6arXiv · cs.LG·18d ago·source ↗

Drifting Preference Optimization (DrPO) for One-Step Text-to-Image Generators

DrPO is a new online preference fine-tuning method designed specifically for deterministic one-step text-to-image generators like SD-Turbo and SDXL-Turbo, which are difficult to align with standard RLHF methods that require policy likelihoods or differentiable reward gradients. The method samples candidates per prompt, ranks them with a target reward, and synthesizes a feature-space update direction via a non-parametric dipole preference field plus a reference drift from the frozen base model. Because the reward is used only for ranking, DrPO supports black-box and non-differentiable reward functions while keeping inference as a single forward pass. Evaluations on HPSv3 and GenEval show improved alignment over reward-gradient-free baselines and a 3.51× reduction in training compute by eliminating reward-model backpropagation.

Inference Economics Alignment and RLHF SDXL Turbo HPSv3 GenEval +4 more

3Hugging Face Blog·1mo ago·source ↗

What Makes a Dialog Agent Useful?

A Hugging Face blog post from January 2023 examining the properties that make dialog agents useful, likely covering aspects such as instruction-following, helpfulness, and alignment techniques. Published in the context of growing interest in ChatGPT and RLHF-trained conversational models, the post reflects the community's effort to understand and replicate capable dialog systems. As a tier-2 commentary piece, it offers analytical framing rather than new empirical results.

Agent and Tool Ecosystem Alignment and RLHF ChatGPT Reinforcement Learning from Human Feedback Hugging Face

5arXiv · cs.LG·23d ago·source ↗

AMRS: Rollout-Based World Model for Offline Affective Music Recommendation with DPO

LUCID's Affective Music Recommendation System (AMRS) uses a causal transformer world model trained on logged listening data to jointly predict engagement, ratings, and self-reported valence/arousal, enabling offline policy optimization without ethically problematic online experimentation. A recommender policy is initialized via behavior cloning and fine-tuned with Direct Preference Optimization (DPO) against a multi-objective utility function. The system is deployed on LUCID's health-and-wellness platforms serving clinical users (older adults with neurocognitive conditions) and consumer-wellness users across four modes. Under cold-start conditions, DPO improves predicted affective signals over the cloned baseline while maintaining diversity and avoiding distributional collapse.

Enterprise Deployment Patterns Agent and Tool Ecosystem behavior cloning world model Direct Preference Optimization (DPO)+4 more

7Openai Blog·1mo ago·source ↗

Learning from Human Preferences: OpenAI and DeepMind Collaborate on Reward Learning from Comparisons

OpenAI, in collaboration with DeepMind's safety team, published a method for learning reward functions directly from human preference comparisons between pairs of agent behaviors, eliminating the need to hand-code goal functions. The algorithm infers human intent by asking evaluators which of two proposed behaviors is preferable, addressing risks from misspecified reward functions. This work is an early foundational contribution to what would become reinforcement learning from human feedback (RLHF). It targets both safety and alignment concerns around reward hacking and proxy gaming.

Evaluation and Benchmarking AI Safety Research Reward Learning from Comparisons DeepMind Reinforcement Learning from Human Feedback +2 more

7arXiv · cs.CL·1mo ago·source ↗

General Preference Reinforcement Learning (GPRL): Bridging Online RL and Preference Optimization for Open-Ended Tasks

GPRL proposes a new alignment framework that replaces scalar reward models with a General Preference Model (GPM) embedding responses into k skew-symmetric subspaces to capture multi-dimensional, intransitivity-aware preferences. The method computes per-dimension group-relative advantages, normalizes across axes, and uses a closed-loop drift monitor to detect and correct single-axis reward hacking during training. Starting from Llama-3-8B-Instruct, GPRL achieves a 56.51% length-controlled win rate on AlpacaEval 2.0 and outperforms SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench. The work directly addresses the gap between verifiable-reward online RL (strong on math/code) and preference optimization (strong on open-ended tasks).

Frontier Model Releases Evaluation and Benchmarking WildBench MT-Bench General Preference Reinforcement Learning +7 more