6arXiv cs.LG (Machine Learning)·10d ago

Phase diagram framework for choosing between cross-modal alignment and prediction in multimodal learning

A new arXiv preprint develops a unified linear framework to determine when cross-modal alignment (CA) versus cross-modal prediction (CP) is the better objective for multimodal representation learning. Under a spiked signal-plus-noise model, the authors derive separation ratios that expose complementary failure modes for each paradigm, producing a four-regime phase diagram (Both, CA only, CP only, Neither). A data-driven procedure lets practitioners locate their dataset in this diagram using a small labeled subsample before committing to training. Experiments on synthetic data, stereo-vision, image-caption pairs, and astrophysical data validate the framework, including a 'Neither' regime where cross-modal training is actively harmful.

Evaluation and Benchmarking Multimodal Progress When to Align, When to Predict: A Phase Diagram for Multimodal Learning

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

7arXiv · cs.AI·29d ago·source ↗

The Matching Principle: A Geometric Theory Unifying Robustness, Domain Adaptation, and Alignment via Nuisance Covariance

This paper proposes the 'matching principle': a unified geometric framework arguing that robustness methods (CORAL, IRM, adversarial training, augmentation, metric learning, Jacobian penalties, alignment constraints) are all estimators of the same object—the covariance of label-preserving deployment nuisance—and that regularizing the encoder Jacobian along this covariance's range is the core statistical problem. The authors prove closed-form optimality results in a linear-Gaussian model, introduce the Trajectory Deviation Index (TDI) as a label-free embedding sensitivity probe, and validate predictions across 13 pre-registered experimental blocks including Qwen2.5-7B. At 7B scale, matched style-PMH improves selective honesty while standard DPO degrades Style TDI, connecting the theory to alignment safety.

Evaluation and Benchmarking AI Safety Research Invariant Risk Minimization Matching Principle Qwen2.5-7B +5 more

5arXiv · cs.CL·23d ago·source ↗

VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading

This paper compares matched LLM and VLM pairs in a text-only setting to isolate the effect of multimodal training history on human-like language processing. Using whole-cortex fMRI and eye-tracking data from natural reading, the authors find that multimodal pretraining does not confer a uniform global advantage in human alignment. However, VLMs show selective advantages when sentences contain stronger visual semantic content, with converging evidence from both neural and behavioral measures. The findings suggest language-internal representations remain the primary driver of human text processing alignment.

Evaluation and Benchmarking Alignment and RLHF large language models human alignment (neural/behavioral)fMRI +4 more

6arXiv · cs.CL·17d ago·source ↗

Adversarial robustness and safety alignment in multilingual multimodal LLMs: cross-lingual vulnerability and 'safety-by-failure'

A systematic study evaluates adversarial robustness and safety alignment of multimodal LLMs across 12 languages, finding that adversarial images optimized in one language transfer to others (cross-lingual transferability). The paper introduces the concept of 'safety-by-failure': low-resource languages appear safer not due to genuine alignment but because models fail to comprehend harmful instructions in those languages. Models like Qwen3-VL that integrate multilingual capability throughout training (rather than only at instruction tuning) show genuine cross-lingual safety with active refusal. The findings challenge the assumption that low-resource language safety metrics reflect real alignment.

Evaluation and Benchmarking AI Safety Research Qwen3-4B Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models +1 more

6arXiv · cs.AI·26d ago·source ↗

Joint Energy-Based Models Reveal a Generative-Discriminative Sweet Spot for Human-Aligned Vision

Researchers use Joint Energy-Based Models (JEMs) to isolate the effect of learning objective—independent of architecture, scale, and data—on human alignment in visual representations. By varying a single mixing coefficient between discriminative and generative training, they evaluate models across six human-alignment benchmarks and find that alignment peaks at intermediate points on the generative-discriminative continuum rather than at either extreme. The results suggest that hybrid objectives combining categorical structure from discriminative learning with input-structure sensitivity from generative learning yield the most human-like visual behavior. This challenges the framing of generative vs. discriminative as a binary choice for building human-aligned vision systems.

Evaluation and Benchmarking Alignment and RLHF human alignment benchmarks (perceptual similarity, gloss, robustness, shape-texture)Joint Energy-Based Models (JEMs)generative-discriminative continuum

5arXiv · cs.AI·1mo ago·source ↗

Beyond Prediction Accuracy: Target-Space Recovery Profiles for Evaluating Model-Brain Alignment

This paper introduces a framework for evaluating alignment between artificial vision models and the human visual cortex that goes beyond scalar prediction accuracy. Using repeated fMRI data from the Natural Scenes Dataset, the authors decompose brain response spaces into reproducible dimensions and measure which of these dimensions are recovered by model predictions. A key finding is that pretrained and randomly initialized models can achieve similar prediction accuracy while showing distinct recovery profiles, revealing that accuracy alone can mask fundamental model-brain mismatches. The framework also enables brain-to-brain comparisons as a diagnostic human reference baseline.

Evaluation and Benchmarking Multimodal Progress Natural Scenes Dataset human visual cortex target-space recovery profiles +1 more

6arXiv · cs.CL·25d ago·source ↗

MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models

MAGIC is a training-free coreset selection method for multimodal instruction tuning that uses three intrinsic signals—Multimodal Gain, Bridging Relevance, and Skill-Neuron Signatures—to identify compact, behaviorally faithful training subsets without backpropagation. The method operates in a three-stage pipeline: filtering low-gain examples, ranking by a quality objective, and bucket-wise budget allocation over neuron signatures. On LLaVA-665K and Vision-Flan datasets with 20% data budgets, MAGIC matches or slightly exceeds full fine-tuning performance (100.3% and 101.6% relative) while reducing wall-clock training time by 73.7%. Results transfer to LLaVA-1.5-7B and -13B target models.

Training Infrastructure Inference Economics MAGIC LLaVA-1.5-7B LLaVA-665K +5 more

6arXiv · cs.AI·18d ago·source ↗

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

This paper identifies and analyzes 'Perceptual Judgment Bias' in multimodal LLM judges, where models anchor on response text rather than visual evidence when the two conflict. The authors introduce a Perceptually Perturbed Judgment Dataset using counterfactual responses to isolate perceptual errors, and a training framework combining GRPO-based reward modeling with batch-ranking objectives. Experiments on MLLM-as-a-Judge benchmarks show improved perceptual fidelity, ranking coherence, and alignment with human evaluation.

Evaluation and Benchmarking Alignment and RLHF Perceptually Perturbed Judgment Dataset Multimodal Large Language Models GRPO +3 more

5arXiv · cs.AI·4d ago·source ↗

Internal Oppenheim-Lim test reveals phase/sign identity codes shared across image classifier architectures

A new arXiv preprint applies a causal intervention inspired by Oppenheim and Lim (1981) to probe whether trained image classifiers encode identity in Fourier phase rather than magnitude within their hidden layers. By transplanting phase or sign components between images at chosen layers in PRISM2D, GFNet, ViT-B/16, and ResNet-50, the authors find that predictions follow the phase/sign donor across all tested architectures, with image-specific magnitude largely dispensable. ResNet-50 requires a pre-ReLU intervention to reveal a latent sign code, exposing how rectification and readout geometry shape the basis in which the code is expressed. The findings offer a mechanistic account of the texture–shape gap between CNNs and attention-based models.

Evaluation and Benchmarking ViT-B/16 GFNet PRISM2D +2 more