5arXiv cs.CL (Computation and Language)·23d ago

VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading

This paper compares matched LLM and VLM pairs in a text-only setting to isolate the effect of multimodal training history on human-like language processing. Using whole-cortex fMRI and eye-tracking data from natural reading, the authors find that multimodal pretraining does not confer a uniform global advantage in human alignment. However, VLMs show selective advantages when sentences contain stronger visual semantic content, with converging evidence from both neural and behavioral measures. The findings suggest language-internal representations remain the primary driver of human text processing alignment.

Evaluation and Benchmarking Alignment and RLHF Multimodal Progress large language models human alignment (neural/behavioral)fMRI eye-tracking Vision-Language Models multimodal pretraining

Related guides (3)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6Hugging Face Blog·1mo ago·source ↗

Vision Language Model Alignment in TRL

Hugging Face's TRL library has added support for aligning Vision Language Models (VLMs), extending existing RLHF and preference optimization tooling to multimodal settings. The blog post covers the new capabilities for training VLMs with alignment techniques such as DPO and related methods. This expands the open-source ecosystem for multimodal model fine-tuning and alignment.

Open Weights Progress Agent and Tool Ecosystem Direct Preference Optimization (DPO)Vision-Language Models Hugging Face +3 more

5arXiv · cs.CL·24d ago·source ↗

Real Images, Worse Judgments: Evaluating VLMs on Concreteness and Imagery

This paper evaluates whether vision-language models (VLMs) benefit from real image context when making lexical judgments about word concreteness and imagery. The authors find that real-image contexts frequently hurt alignment with human ratings, especially when visual evidence is least relevant to the word being judged. Probing and canonical correlation analysis reveal that real images cause representational shifts and increased sensitivity to spurious visual cues. Instructing models to focus on text-only content at inference time partially mitigates this degradation.

Evaluation and Benchmarking Multimodal Progress concreteness ratings canonical correlation analysis Vision-Language Models +2 more

4Hugging Face Blog·1mo ago·source ↗

A Dive into Vision-Language Models

This Hugging Face blog post provides a technical overview of vision-language model (VLM) pretraining approaches, covering architectures and training strategies used to align visual and textual representations. It surveys key models and techniques in the multimodal learning space as of early 2023. The post serves as an educational reference for practitioners working with or building VLMs.

Multimodal Progress Contrastive Language-Image Pretraining (CLIP)Vision-Language Models Hugging Face

6arXiv · cs.CL·22d ago·source ↗

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.

Evaluation and Benchmarking Alignment and RLHF LoMo LLaVA-OneVision-1.5-8B Qwen3-4B +3 more

5arXiv · cs.CL·17d ago·source ↗

Visual instruction tuning aligns modalities in intermediate LLM layers, not early ones

A new arXiv paper investigates how visual instruction tuning embeds image features into the layer-wise hierarchy of LLM backbones across diverse vision-language architectures. Using probing analyses and causal interventions, the authors find that instruction tuning routes visual features into intermediate semantic layers, bypassing early unimodal-processing layers. They further show that fine-tuning restricted to these intermediate layers alone preserves full fine-tuning performance on vision-centric benchmarks while reducing training time, suggesting multimodal integration is a localized phenomenon.

Alignment and RLHF Multimodal Progress Visual Instruction Tuning Aligns Modalities through Abstraction

5Hugging Face Blog·1mo ago·source ↗

Preference Optimization for Vision Language Models

This Hugging Face blog post covers the application of Direct Preference Optimization (DPO) to vision-language models (VLMs). It likely discusses how preference learning techniques originally developed for text-only LLMs can be adapted to multimodal settings. The post addresses training methodology for aligning VLMs with human preferences across both visual and textual modalities.

Alignment and RLHF Multimodal Progress Direct Preference Optimization (DPO)Vision-Language Models Hugging Face

5Hugging Face Blog·1mo ago·source ↗

Vision Language Models (Better, faster, stronger)

A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.

Open Weights Progress Inference Economics Vision-Language Models Hugging Face +1 more

3Hugging Face Blog·1mo ago·source ↗

Vision Language Models Explained

A Hugging Face blog post providing a technical overview of vision language models (VLMs), covering their architecture, training approaches, and capabilities. The post serves as an educational resource explaining how VLMs combine visual and language understanding. As a tier-2 commentary piece, it synthesizes existing knowledge rather than presenting new research findings.

Multimodal Progress Vision-Language Models Hugging Face