5Hugging Face Blog·1mo ago

Preference Optimization for Vision Language Models

This Hugging Face blog post covers the application of Direct Preference Optimization (DPO) to vision-language models (VLMs). It likely discusses how preference learning techniques originally developed for text-only LLMs can be adapted to multimodal settings. The post addresses training methodology for aligning VLMs with human preferences across both visual and textual modalities.

Alignment and RLHF Multimodal Progress Direct Preference Optimization (DPO)Vision-Language Models Hugging Face

Related guides (4)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Direct Preference Optimization (DPO)Concept

Direct Preference Optimization (DPO): Reward-Free Alignment for LLMs

Read asIn-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Preference Tuning LLMs with Direct Preference Optimization Methods

A Hugging Face blog post surveys Direct Preference Optimization (DPO) and related preference tuning methods for aligning large language models. The post covers the landscape of DPO variants and their practical application via the TRL library. It serves as a technical reference for practitioners implementing RLHF alternatives.

Agent and Tool Ecosystem Alignment and RLHF Reinforcement Learning from Human Feedback Direct Preference Optimization (DPO)Hugging Face +1 more

3Hugging Face Blog·17d ago·source ↗

Direct Preference Optimization Beyond Chatbots

A Hugging Face blog post explores applications of Direct Preference Optimization (DPO) outside of conversational AI contexts. The post appears to survey or analyze how DPO, a technique for aligning language models with human preferences, can be applied to non-chatbot domains. The body content is unavailable, limiting assessment of specific claims or findings.

Alignment and RLHF Direct Preference Optimization (DPO)Hugging Face

4Hugging Face Blog·1mo ago·source ↗

A Dive into Vision-Language Models

This Hugging Face blog post provides a technical overview of vision-language model (VLM) pretraining approaches, covering architectures and training strategies used to align visual and textual representations. It surveys key models and techniques in the multimodal learning space as of early 2023. The post serves as an educational reference for practitioners working with or building VLMs.

Multimodal Progress Contrastive Language-Image Pretraining (CLIP)Vision-Language Models Hugging Face

6Hugging Face Blog·1mo ago·source ↗

Vision Language Model Alignment in TRL

Hugging Face's TRL library has added support for aligning Vision Language Models (VLMs), extending existing RLHF and preference optimization tooling to multimodal settings. The blog post covers the new capabilities for training VLMs with alignment techniques such as DPO and related methods. This expands the open-source ecosystem for multimodal model fine-tuning and alignment.

Open Weights Progress Agent and Tool Ecosystem Direct Preference Optimization (DPO)Vision-Language Models Hugging Face +3 more

5Hugging Face Blog·1mo ago·source ↗

Fine-tune Llama 2 with DPO

This Hugging Face blog post provides a practical guide to fine-tuning Llama 2 using Direct Preference Optimization (DPO) via the TRL library. It covers the alignment technique that bypasses the need for a separate reward model compared to RLHF, walking through dataset preparation, training configuration, and implementation details. The post targets practitioners looking to apply preference-based alignment to open-weights models.

Open Weights Progress Agent and Tool Ecosystem Meta AI Llama 2 Direct Preference Optimization (DPO)+3 more

5Hugging Face Blog·1mo ago·source ↗

Vision Language Models (Better, faster, stronger)

A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.

Open Weights Progress Inference Economics Vision-Language Models Hugging Face +1 more

3Hugging Face Blog·1mo ago·source ↗

Vision Language Models Explained

A Hugging Face blog post providing a technical overview of vision language models (VLMs), covering their architecture, training approaches, and capabilities. The post serves as an educational resource explaining how VLMs combine visual and language understanding. As a tier-2 commentary piece, it synthesizes existing knowledge rather than presenting new research findings.

Multimodal Progress Vision-Language Models Hugging Face

6arXiv · cs.CL·22d ago·source ↗

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.

Evaluation and Benchmarking Alignment and RLHF LoMo LLaVA-OneVision-1.5-8B Qwen3-4B +3 more