5arXiv cs.LG (Machine Learning)·3d ago

Large-scale benchmarking finds dataset distillation methods fail to outperform coresets on ImageNet-scale tasks

A new arXiv paper critically evaluates seven state-of-the-art dataset distillation (DD) methods against coreset selection (CS) strategies using standardized protocols on ImageNet-1K, ImageNet100, and ImageNette. Results show that some DD methods fail to beat random subsets, and SOTA DD approaches are comparable to or worse than coresets on large-scale datasets while incurring substantially higher construction costs. The paper also finds coresets achieve better coverage of the original data distribution in terms of representativeness and diversity, challenging the prevailing assumption that synthetic samples are inherently more expressive than real-data subsets.

Training Infrastructure Evaluation and Benchmarking Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?ImageNette ImageNet ImageNet100

Related guides (2)

Training InfrastructureTopic guide

Training Infrastructure: The Compute Arms Race Powering Modern AI

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·10d ago·source ↗

Step-aligned critique outperforms GRPO and reference-solution conditioning in self-distillation

A new arXiv paper investigates context design for self-distillation of language models, comparing binary reward (GRPO), reference solutions, and step-by-step critiques aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution conditioning by 5.27 points on Avg@12. Per-token advantage analysis shows that step-aligned feedback targets only failing tokens, avoiding unnecessary pressure on already-correct reasoning steps. The findings suggest structural alignment between feedback and the model's reasoning trace is a key driver of self-distillation effectiveness.

Evaluation and Benchmarking Alignment and RLHF GRPO The Role of Feedback Alignment in Self-Distillation

4arXiv · cs.CL·5d ago·source ↗

MoDiCoL: A modular continual learning dataset for diagnosing ASR robustness under distribution shift

Researchers introduce MoDiCoL, a benchmark dataset designed to evaluate automatic speech recognition robustness under co-occurring real-world distribution shifts including accents, recording conditions, speech impairments, and noise. Unlike existing benchmarks that isolate these factors, MoDiCoL enables controlled analysis across linguistic, speaker, and acoustic dimensions simultaneously. The paper also proposes a continual learning curriculum simulating incremental updates and evaluates three continual learning strategies for robustness acquisition and forgetting.

Evaluation and Benchmarking MoDiCoL

5arXiv · cs.CL·2d ago·source ↗

RECOM benchmark reveals validity-discrimination tradeoff in automatic metrics for open-ended QA

Researchers introduce RECOM, a contamination-free evaluation dataset of 15,000 r/AskReddit questions paired with authentic community replies postdating all evaluated models' training cutoffs. Testing five open-source 7–10B LLMs, the paper finds that no standard automatic metric (cosine similarity, BERTScore, LLM judges) simultaneously achieves both validity (distinguishing real from random answers) and discriminative power (ranking models against each other). Cosine similarity is valid but cannot rank models; BERTScore's apparent ranking collapses when response length is controlled. The authors argue this tradeoff is a structural property of metric representation design and recommend reporting metrics on both axes with an explicit random-baseline floor.

Evaluation and Benchmarking BERTScore RECOM r/AskReddit

7arXiv · cs.AI·1mo ago·source ↗

DeepWeb-Bench: A Hard Deep Research Benchmark Requiring Cross-Source Evidence and Long-Horizon Derivation

DeepWeb-Bench is a new benchmark designed to stress-test frontier language models on deep research tasks—open-web search, evidence collection, and multi-step derivation—where existing benchmarks have become saturated. The benchmark evaluates nine frontier models across four capability families (Retrieval, Derivation, Reasoning, Calibration) and finds that retrieval is not the primary bottleneck; derivation and calibration failures account for over 70% of errors. Strong models fail via incomplete derivation while weak models fail via hallucinated precision, and models show genuine domain specialization with low cross-model agreement (rho = 0.61). The benchmark, rubrics, and evaluation code are publicly released.

Frontier Model Releases Evaluation and Benchmarking deep research agents DeepWeb-Bench Retrieval-Augmented Generation +2 more

7arXiv · cs.AI·23d ago·source ↗

CORE: Contrastive Reflection for Sample-Efficient Reasoning Improvement

CORE (Contrastive Reflection) is a non-parametric learning algorithm that improves LLM reasoning by comparing successful and unsuccessful reasoning traces to generate compact natural-language 'insights' about reasoning strategies. Across four reasoning tasks, CORE outperforms both parametric baselines (GRPO/RLVR) and non-parametric baselines (GEPA, episodic RAG, MemRL) under fixed rollout budgets, achieving comparable or better gains with as few as five training samples. The method is also more context-efficient than prompt-optimization approaches, storing learned knowledge as interpretable natural-language descriptions rather than raw traces or weight updates. The results suggest contrastive distillation of reasoning traces may be a more efficient route to self-improvement than traditional fine-tuning.

Evaluation and Benchmarking Inference Economics RLVR GRPO CORE (Contrastive Reflection)+5 more

5arXiv · cs.AI·25d ago·source ↗

WSADBench: A Unified Benchmark for Weakly Supervised Anomaly Detection

WSADBench is the first benchmark to unify evaluation across the three primary weakly supervised anomaly detection (WSAD) paradigms—incomplete, inexact, and inaccurate supervision—testing 36 algorithms across 4 modalities with over 700K experiments. Key findings challenge the isolation of current WSAD research directions, showing strong correlations between supervision scenarios and that specialized WSAD methods are quickly outperformed by tabular foundation models as label availability increases. The benchmark also reveals inconsistent utility of unlabeled data and asymmetric model sensitivity to label noise types. Code and datasets are released open-source.

Evaluation and Benchmarking WSADBench weakly supervised anomaly detection SUFE-AILAB +1 more

6arXiv · cs.LG·26d ago·source ↗

Strong Teacher Not Needed? On Distillation in LLM Pretraining

This paper challenges the conventional assumption that knowledge distillation requires a stronger teacher to produce better students. Through systematic variation of architecture sizes and training token budgets, the authors find that even small, undertrained teachers can improve larger student models when language modeling and distillation losses are properly mixed. Counterintuitively, stronger teachers can saturate or reverse distillation gains, and distillation benefits generalization more than in-domain fitting.

Training Infrastructure Frontier Model Releases knowledge distillation Language Modeling Loss Weak-to-Strong Distillation +2 more

5arXiv · cs.LG·1mo ago·source ↗

AUDITS: A Comprehensive Benchmark for Image Manipulation Localization Across Multiple Analysis Axes

Researchers introduce AUDITS (Analysis Under Domain-shifts, qualIty, Type, and Size), a benchmark of over 530K images designed to evaluate image manipulation detection across multiple axes including domain shift, manipulation type, and size. The dataset draws from user and news photos and incorporates recent diffusion-based inpaintings. Experiments assess the robustness of existing manipulation detection methods under various domain shifts, aiming to advance development of more generalizable detection approaches.

Evaluation and Benchmarking AI Safety Research AUDITS image manipulation detection image manipulation localization +2 more