A new arXiv preprint provides a rigorous theoretical analysis of distributed self-supervised learning (D-SSL) frameworks under non-IID (heterogeneous) data conditions. The key findings are that Masked Image Modeling (MIM) is inherently more robust to data heterogeneity than Contrastive Learning (CL), and that federated learning is no less robust than fully decentralized learning due to network connectivity effects. The authors also introduce MAR loss, a refinement of the MIM objective with local-to-global alignment regularization, validated across multiple architectures and distributed settings.
Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.
A new arXiv paper critically evaluates seven state-of-the-art dataset distillation (DD) methods against coreset selection (CS) strategies using standardized protocols on ImageNet-1K, ImageNet100, and ImageNette. Results show that some DD methods fail to beat random subsets, and SOTA DD approaches are comparable to or worse than coresets on large-scale datasets while incurring substantially higher construction costs. The paper also finds coresets achieve better coverage of the original data distribution in terms of representativeness and diversity, challenging the prevailing assumption that synthetic samples are inherently more expressive than real-data subsets.
Researchers introduce MADreMIA, a model-agnostic framework for membership inference attacks (MIA) and dataset inference (DI) that uses iterative chained regeneration across modalities rather than shadow model training. The key insight is that memorized training samples exhibit higher coherence and slower degradation under repeated regeneration than non-member samples, yielding stronger membership signals at low false positive rates. The framework is evaluated across image autoregressive models, diffusion models, language models, and audio models, supporting white-, gray-, and black-box threat models. This work advances privacy auditing and copyright enforcement capabilities for large generative models.
This paper proposes conditioning diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, augmented with VAE-based identity conditioning to address copy-paste artifacts and identity preservation failures in subject-driven image generation. A Dual Layer Aggregation (DLA) module aggregates multi-level MLLM features, and a multi-stage denoising strategy progressively balances semantic and fine-detail identity signals during inference. Experiments show improved human preference scores on subject-driven generation benchmarks compared to prior approaches that encode text and reference images separately.
OLIVE (Online Latent prediction with Invariant Views and rEconstruction) is a new self-supervised speech representation learning framework that combines view-augmented masked latent prediction with waveform reconstruction under a unified objective. The reconstruction objective constrains early encoder features to retain signal-level information, while masked latent prediction shapes later contextual representations toward invariance. The authors report improvements on generation and speaker tasks, competitive performance on recognition and semantic tasks, and better waveform reconstruction compared to prior SSL approaches.
Researchers introduce MoDiCoL, a benchmark dataset designed to evaluate automatic speech recognition robustness under co-occurring real-world distribution shifts including accents, recording conditions, speech impairments, and noise. Unlike existing benchmarks that isolate these factors, MoDiCoL enables controlled analysis across linguistic, speaker, and acoustic dimensions simultaneously. The paper also proposes a continual learning curriculum simulating incremental updates and evaluates three continual learning strategies for robustness acquisition and forgetting.
Researchers propose a masked multimodal speech synthesis framework that jointly trains on surface electromyography (sEMG) and video-based lipreading signals using modality masking to improve robustness to sensor failure or degradation. In multispeaker settings, the approach reduces word error rate by up to 14 absolute percentage points over the strongest unimodal baseline. Masking strategies outperform degradation-specific data augmentation for handling missing modalities, with phone-level analysis revealing complementary contributions across vowels and consonant groups.
Researchers propose FedReLa, a data-level method for federated learning that addresses the coexistence of global class imbalance and cross-client data heterogeneity. The approach uses a feature-dependent label re-allocator to correct biased global decision boundaries without requiring knowledge of the global class distribution. FedReLa is model-agnostic and modular, integrating with existing algorithmic methods without additional communication overhead, and claims state-of-the-art results on stepwise-imbalanced and long-tailed datasets.