4arXiv cs.LG (Machine Learning)·20h ago

Theoretical analysis shows MIM more robust than contrastive learning in distributed self-supervised learning under non-IID data

A new arXiv preprint provides a rigorous theoretical analysis of distributed self-supervised learning (D-SSL) frameworks under non-IID (heterogeneous) data conditions. The key findings are that Masked Image Modeling (MIM) is inherently more robust to data heterogeneity than Contrastive Learning (CL), and that federated learning is no less robust than fully decentralized learning due to network connectivity effects. The authors also introduce MAR loss, a refinement of the MIM objective with local-to-global alignment regularization, validated across multiple architectures and distributed settings.

Training Infrastructure Understanding the Robustness of Distributed Self-Supervised Learning Frameworks Against Non-IID Data MAR loss Masked Image Modeling

Related guides (1)

Training InfrastructureTopic guide

Training Infrastructure: The Compute Arms Race Powering Modern AI

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·May 19, 2026·source ↗

Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs

Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.

Evaluation and Benchmarking Agent and Tool Ecosystem Multimodal Large Language Models Thinking-with-Images on-policy self-distillation +4 more

5arXiv · cs.LG·Jun 17, 2026·source ↗

Large-scale benchmarking finds dataset distillation methods fail to outperform coresets on ImageNet-scale tasks

A new arXiv paper critically evaluates seven state-of-the-art dataset distillation (DD) methods against coreset selection (CS) strategies using standardized protocols on ImageNet-1K, ImageNet100, and ImageNette. Results show that some DD methods fail to beat random subsets, and SOTA DD approaches are comparable to or worse than coresets on large-scale datasets while incurring substantially higher construction costs. The paper also finds coresets achieve better coverage of the original data distribution in terms of representativeness and diversity, challenging the prevailing assumption that synthetic samples are inherently more expressive than real-data subsets.

Training Infrastructure Evaluation and Benchmarking Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?ImageNette ImageNet +1 more

6arXiv · cs.AI·2d ago·source ↗

MADreMIA: Chained Regeneration Framework for Amplifying Membership Inference Signals

Researchers introduce MADreMIA, a model-agnostic framework for membership inference attacks (MIA) and dataset inference (DI) that uses iterative chained regeneration across modalities rather than shadow model training. The key insight is that memorized training samples exhibit higher coherence and slower degradation under repeated regeneration than non-member samples, yielding stronger membership signals at low false positive rates. The framework is evaluated across image autoregressive models, diffusion models, language models, and audio models, supporting white-, gray-, and black-box threat models. This work advances privacy auditing and copyright enforcement capabilities for large generative models.

Evaluation and Benchmarking AI Safety Research Model Autophagy Disorder MADreMIA

5arXiv · cs.LG·May 26, 2026·source ↗

Squeezing Capacity from MLLMs for Subject-driven Image Generation via Dual Layer Aggregation

This paper proposes conditioning diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, augmented with VAE-based identity conditioning to address copy-paste artifacts and identity preservation failures in subject-driven image generation. A Dual Layer Aggregation (DLA) module aggregates multi-level MLLM features, and a multi-stage denoising strategy progressively balances semantic and fine-detail identity signals during inference. Experiments show improved human preference scores on subject-driven generation benchmarks compared to prior approaches that encode text and reference images separately.

Agent and Tool Ecosystem Multimodal Progress Multimodal Large Language Models Dual Layer Aggregation (DLA)Subject-driven Image Generation +3 more

4arXiv · cs.CL·3d ago·source ↗

OLIVE: Joint masked latent prediction and waveform reconstruction for self-supervised speech representation learning

OLIVE (Online Latent prediction with Invariant Views and rEconstruction) is a new self-supervised speech representation learning framework that combines view-augmented masked latent prediction with waveform reconstruction under a unified objective. The reconstruction objective constrains early encoder features to retain signal-level information, while masked latent prediction shapes later contextual representations toward invariance. The authors report improvements on generation and speaker tasks, competitive performance on recognition and semantic tasks, and better waveform reconstruction compared to prior SSL approaches.

Evaluation and Benchmarking Olive

4arXiv · cs.CL·Jun 15, 2026·source ↗

MoDiCoL: A modular continual learning dataset for diagnosing ASR robustness under distribution shift

Researchers introduce MoDiCoL, a benchmark dataset designed to evaluate automatic speech recognition robustness under co-occurring real-world distribution shifts including accents, recording conditions, speech impairments, and noise. Unlike existing benchmarks that isolate these factors, MoDiCoL enables controlled analysis across linguistic, speaker, and acoustic dimensions simultaneously. The paper also proposes a continual learning curriculum simulating incremental updates and evaluates three continual learning strategies for robustness acquisition and forgetting.

Evaluation and Benchmarking MoDiCoL

4arXiv · cs.CL·Jun 9, 2026·source ↗

Cross-modal masking framework improves silent speech synthesis from sEMG and lipreading

Researchers propose a masked multimodal speech synthesis framework that jointly trains on surface electromyography (sEMG) and video-based lipreading signals using modality masking to improve robustness to sensor failure or degradation. In multispeaker settings, the approach reduces word error rate by up to 14 absolute percentage points over the strongest unimodal baseline. Masking strategies outperform degradation-specific data augmentation for handling missing modalities, with phone-level analysis revealing complementary contributions across vowels and consonant groups.

Multimodal Progress Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

4arXiv · cs.LG·Jun 25, 2026·source ↗

FedReLa: Re-labeling approach for imbalanced federated learning under data heterogeneity

Researchers propose FedReLa, a data-level method for federated learning that addresses the coexistence of global class imbalance and cross-client data heterogeneity. The approach uses a feature-dependent label re-allocator to correct biased global decision boundaries without requiring knowledge of the global class distribution. FedReLa is model-agnostic and modular, integrating with existing algorithmic methods without additional communication overhead, and claims state-of-the-art results on stepwise-imbalanced and long-tailed datasets.

FedReLa

Theoretical analysis shows MIM more robust than contrastive learning in distributed self-supervised learning under non-IID data

Related events (8)

6arXiv · cs.CL·May 19, 2026·source ↗