When to Align, When to Predict: A Phase Diagram for Multimodal Learning
when-to-align-when-to-predict-a-phase-diagram-for-multimodal-learning-c5aeedec·1 events·first seen 7d agoAliases: When to Align, When to Predict: A Phase Diagram for Multimodal Learning
More like this (12)
Recent events (1)
Phase diagram framework for choosing between cross-modal alignment and prediction in multimodal learning
A new arXiv preprint develops a unified linear framework to determine when cross-modal alignment (CA) versus cross-modal prediction (CP) is the better objective for multimodal representation learning. Under a spiked signal-plus-noise model, the authors derive separation ratios that expose complementary failure modes for each paradigm, producing a four-regime phase diagram (Both, CA only, CP only, Neither). A data-driven procedure lets practitioners locate their dataset in this diagram using a small labeled subsample before committing to training. Experiments on synthetic data, stereo-vision, image-caption pairs, and astrophysical data validate the framework, including a 'Neither' regime where cross-modal training is actively harmful.