paper
Visual Instruction Tuning Aligns Modalities through Abstraction
paperactiveprovisional
visual-instruction-tuning-aligns-modalities-through-abstraction-6f270525·1 events·first seen 13d agoAliases: Visual Instruction Tuning Aligns Modalities through Abstraction
More like this (12)
Multimodal Continual Instruction TuningMultimodal Learninginstruction tuningLatent World Recovery for Multimodal Learning with Missing ModalitiesWhen to Align, When to Predict: A Phase Diagram for Multimodal LearningLocal Modality Substitutionmultimodal pretrainingMultimodal GainVision-Language-Action modelsTraining LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference OptimizationLearning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-TuningInstruction Hierarchy
Recent events (1)
Visual instruction tuning aligns modalities in intermediate LLM layers, not early ones
A new arXiv paper investigates how visual instruction tuning embeds image features into the layer-wise hierarchy of LLM backbones across diverse vision-language architectures. Using probing analyses and causal interventions, the authors find that instruction tuning routes visual features into intermediate semantic layers, bypassing early unimodal-processing layers. They further show that fine-tuning restricted to these intermediate layers alone preserves full fine-tuning performance on vision-centric benchmarks while reducing training time, suggesting multimodal integration is a localized phenomenon.