Joint Energy-Based Models Reveal a Generative-Discriminative Sweet Spot for Human-Aligned Vision
Researchers use Joint Energy-Based Models (JEMs) to isolate the effect of learning objective—independent of architecture, scale, and data—on human alignment in visual representations. By varying a single mixing coefficient between discriminative and generative training, they evaluate models across six human-alignment benchmarks and find that alignment peaks at intermediate points on the generative-discriminative continuum rather than at either extreme. The results suggest that hybrid objectives combining categorical structure from discriminative learning with input-structure sensitivity from generative learning yield the most human-like visual behavior. This challenges the framing of generative vs. discriminative as a binary choice for building human-aligned vision systems.
Related guides (2)
Related events (8)
Beyond Prediction Accuracy: Target-Space Recovery Profiles for Evaluating Model-Brain Alignment
This paper introduces a framework for evaluating alignment between artificial vision models and the human visual cortex that goes beyond scalar prediction accuracy. Using repeated fMRI data from the Natural Scenes Dataset, the authors decompose brain response spaces into reproducible dimensions and measure which of these dimensions are recovered by model predictions. A key finding is that pretrained and randomly initialized models can achieve similar prediction accuracy while showing distinct recovery profiles, revealing that accuracy alone can mask fundamental model-brain mismatches. The framework also enables brain-to-brain comparisons as a diagnostic human reference baseline.
Beyond Isotropy in JEPAs: Hamiltonian Geometry and Symplectic Prediction
This paper critiques the standard practice of regularizing Joint-Embedding Predictive Architecture (JEPA) encoders toward isotropic Gaussian marginals, showing that this Euclidean symmetry assumption incurs a quantifiable 'price of isotropy' and that no geometry-independent fixed marginal target is universally canonical. The authors prove that oracle one-view marginals do not identify the view-to-view predictive coupling, arguing structural bias should enter the cross-view coupling instead. They introduce HamJEPA, which encodes views as phase-space states and uses a learned Hamiltonian leapfrog map for view-to-view prediction, with symplectic coupling identified as the key driver of gains. HamJEPA outperforms SIGReg on CIFAR-100 by up to +6.45 kNN@20 and +10.64 linear-probe points at 80 epochs, with similar improvements on ImageNet-100.
Implicit Generation and Generalization Methods for Energy-Based Models
OpenAI published research on stable and scalable training of energy-based models (EBMs), achieving sample quality competitive with GANs at low temperatures while retaining mode coverage guarantees of likelihood-based models. The approach uses iterative compute during generation to continually refine outputs. This work positions EBMs as a promising alternative generative modeling paradigm bridging GANs and likelihood-based models.
Semantic Generative Tuning (SGT) for Unified Multimodal Models
This paper introduces Semantic Generative Tuning (SGT), a post-training paradigm for unified multimodal models (UMMs) that bridges the gap between visual understanding and visual generation. The authors find that image segmentation tasks serve as optimal generative proxies, providing structural semantics that improve both perception and generative layout fidelity. SGT aligns representation spaces across understanding and generation objectives, improving feature linear separability and visual-textual attention allocation. Evaluations show consistent gains on multimodal comprehension and generative fidelity benchmarks.
The Matching Principle: A Geometric Theory Unifying Robustness, Domain Adaptation, and Alignment via Nuisance Covariance
This paper proposes the 'matching principle': a unified geometric framework arguing that robustness methods (CORAL, IRM, adversarial training, augmentation, metric learning, Jacobian penalties, alignment constraints) are all estimators of the same object—the covariance of label-preserving deployment nuisance—and that regularizing the encoder Jacobian along this covariance's range is the core statistical problem. The authors prove closed-form optimality results in a linear-Gaussian model, introduce the Trajectory Deviation Index (TDI) as a label-free embedding sensitivity probe, and validate predictions across 13 pre-registered experimental blocks including Qwen2.5-7B. At 7B scale, matched style-PMH improves selective honesty while standard DPO degrades Style TDI, connecting the theory to alignment safety.
MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models
MAGIC is a training-free coreset selection method for multimodal instruction tuning that uses three intrinsic signals—Multimodal Gain, Bridging Relevance, and Skill-Neuron Signatures—to identify compact, behaviorally faithful training subsets without backpropagation. The method operates in a three-stage pipeline: filtering low-gain examples, ranking by a quality objective, and bucket-wise budget allocation over neuron signatures. On LLaVA-665K and Vision-Flan datasets with 20% data budgets, MAGIC matches or slightly exceeds full fine-tuning performance (100.3% and 101.6% relative) while reducing wall-clock training time by 73.7%. Results transfer to LLaVA-1.5-7B and -13B target models.
Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection
This paper introduces Social Gaze Consistency (SGC), a high-level semantic detection axis based on the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals in images. The authors construct a controlled diagnostic dataset with region-specific gaze perturbations and a Block-Compositional Caption Supervision scheme to train detectors without generator-fingerprint memorization shortcuts. Cross-architecture validation shows +3.7 pp improvement on the COCOAI Interaction subset when applied to FakeVLM, with gains transferring from a single inpainter (FLUX.1-Fill) to multi-generator suites. The work argues that diffusion models share a spectral weakness in periocular structure, making gaze coherence a robust, backbone-agnostic detection signal orthogonal to existing low-level artifact methods.
Phase diagram framework for choosing between cross-modal alignment and prediction in multimodal learning
A new arXiv preprint develops a unified linear framework to determine when cross-modal alignment (CA) versus cross-modal prediction (CP) is the better objective for multimodal representation learning. Under a spiked signal-plus-noise model, the authors derive separation ratios that expose complementary failure modes for each paradigm, producing a four-regime phase diagram (Both, CA only, CP only, Neither). A data-driven procedure lets practitioners locate their dataset in this diagram using a small labeled subsample before committing to training. Experiments on synthetic data, stereo-vision, image-caption pairs, and astrophysical data validate the framework, including a 'Neither' regime where cross-modal training is actively harmful.

