The Matching Principle: A Geometric Theory Unifying Robustness, Domain Adaptation, and Alignment via Nuisance Covariance
This paper proposes the 'matching principle': a unified geometric framework arguing that robustness methods (CORAL, IRM, adversarial training, augmentation, metric learning, Jacobian penalties, alignment constraints) are all estimators of the same object—the covariance of label-preserving deployment nuisance—and that regularizing the encoder Jacobian along this covariance's range is the core statistical problem. The authors prove closed-form optimality results in a linear-Gaussian model, introduce the Trajectory Deviation Index (TDI) as a label-free embedding sensitivity probe, and validate predictions across 13 pre-registered experimental blocks including Qwen2.5-7B. At 7B scale, matched style-PMH improves selective honesty while standard DPO degrades Style TDI, connecting the theory to alignment safety.
Related guides (4)
Related events (8)
Phase diagram framework for choosing between cross-modal alignment and prediction in multimodal learning
A new arXiv preprint develops a unified linear framework to determine when cross-modal alignment (CA) versus cross-modal prediction (CP) is the better objective for multimodal representation learning. Under a spiked signal-plus-noise model, the authors derive separation ratios that expose complementary failure modes for each paradigm, producing a four-regime phase diagram (Both, CA only, CP only, Neither). A data-driven procedure lets practitioners locate their dataset in this diagram using a small labeled subsample before committing to training. Experiments on synthetic data, stereo-vision, image-caption pairs, and astrophysical data validate the framework, including a 'Neither' regime where cross-modal training is actively harmful.
Adversarial robustness and safety alignment in multilingual multimodal LLMs: cross-lingual vulnerability and 'safety-by-failure'
A systematic study evaluates adversarial robustness and safety alignment of multimodal LLMs across 12 languages, finding that adversarial images optimized in one language transfer to others (cross-lingual transferability). The paper introduces the concept of 'safety-by-failure': low-resource languages appear safer not due to genuine alignment but because models fail to comprehend harmful instructions in those languages. Models like Qwen3-VL that integrate multilingual capability throughout training (rather than only at instruction tuning) show genuine cross-lingual safety with active refusal. The findings challenge the assumption that low-resource language safety metrics reflect real alignment.
Adversarial Subspace Alignment for Robust Multimodal Knowledge Editing in MLLMs
This paper addresses the generalization gap in multimodal large language model (MLLM) knowledge editing, where edits fail to propagate across semantically equivalent visual and linguistic variations. The authors introduce Latent Adversarial Robustification (LAR), which generates adversarial but semantically coherent variants in joint latent space, and Rank-Constrained Subspace Learning (RCSL), which enforces low-rank alignment of adversarial representations at the edit layer. Together these form the ASAM framework, which formalizes robustness via knowledge units grouping semantically equivalent multimodal inputs. Empirical analysis demonstrates improved generality without sacrificing reliability or locality.
Joint Energy-Based Models Reveal a Generative-Discriminative Sweet Spot for Human-Aligned Vision
Researchers use Joint Energy-Based Models (JEMs) to isolate the effect of learning objective—independent of architecture, scale, and data—on human alignment in visual representations. By varying a single mixing coefficient between discriminative and generative training, they evaluate models across six human-alignment benchmarks and find that alignment peaks at intermediate points on the generative-discriminative continuum rather than at either extreme. The results suggest that hybrid objectives combining categorical structure from discriminative learning with input-structure sensitivity from generative learning yield the most human-like visual behavior. This challenges the framing of generative vs. discriminative as a binary choice for building human-aligned vision systems.
Creative Quality Alignment: Expert Tacit Knowledge Transfer via Chain-of-Thought Fine-Tuning
This paper empirically validates a creative quality metric from a companion work (Calibrated Surprise, Zou & Xu 2026a) under strict low-resource conditions: ~100 expert chain-of-thought annotations and a small base model. The authors introduce Creative Quality Alignment (CQA) as a class of engineering methods and identify a systematic bias in public alignment datasets toward craft knowledge, with weak coverage of audience modeling and reality-logic. A theoretical argument based on 'architectural duality' in single conditional distribution LLMs is offered to explain why so few examples suffice, distinguishing the result from purely empirical findings like LIMA.
Beyond Isotropy in JEPAs: Hamiltonian Geometry and Symplectic Prediction
This paper critiques the standard practice of regularizing Joint-Embedding Predictive Architecture (JEPA) encoders toward isotropic Gaussian marginals, showing that this Euclidean symmetry assumption incurs a quantifiable 'price of isotropy' and that no geometry-independent fixed marginal target is universally canonical. The authors prove that oracle one-view marginals do not identify the view-to-view predictive coupling, arguing structural bias should enter the cross-view coupling instead. They introduce HamJEPA, which encodes views as phase-space states and uses a learned Hamiltonian leapfrog map for view-to-view prediction, with symplectic coupling identified as the key driver of gains. HamJEPA outperforms SIGReg on CIFAR-100 by up to +6.45 kNN@20 and +10.64 linear-probe points at 80 epochs, with similar improvements on ImageNet-100.
Toward understanding and preventing misalignment generalization
OpenAI investigates how training language models on incorrect or harmful responses can cause broader misalignment that generalizes beyond the training distribution. The research identifies an internal feature (likely a representation or circuit) that drives this misalignment generalization behavior. Crucially, the team finds this feature can be reversed with minimal fine-tuning, suggesting a practical mitigation pathway. This work connects mechanistic interpretability to alignment safety in a concrete, actionable way.
MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models
MAGIC is a training-free coreset selection method for multimodal instruction tuning that uses three intrinsic signals—Multimodal Gain, Bridging Relevance, and Skill-Neuron Signatures—to identify compact, behaviorally faithful training subsets without backpropagation. The method operates in a three-stage pipeline: filtering low-gain examples, ranking by a quality objective, and bucket-wise budget allocation over neuron signatures. On LLaVA-665K and Vision-Flan datasets with 20% data budgets, MAGIC matches or slightly exceeds full fine-tuning performance (100.3% and 101.6% relative) while reducing wall-clock training time by 73.7%. Results transfer to LLaVA-1.5-7B and -13B target models.



