paper
Less is More: Quality-Aware Training Data Selection for Scientific Summarization
paperactiveprovisional
less-is-more-quality-aware-training-data-selection-for-scientific-summarization-ebfa935f·1 events·first seen 4d agoAliases: Less is More: Quality-Aware Training Data Selection for Scientific Summarization
Co-occurring entities
More like this (12)
A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge GraphsDetect, Unlearn, Restore: Defending Text Summarization Models Against Data Poisoningclinical text summarizationLearning to Summarize with Human Feedbackhierarchical summarizationquantization-aware trainingDecomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape RobustnessFrom Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language ModelsRecursive SummarizationOn-Policy Self-Distillation with Sampled Demonstrations Reduces Output DiversityLeveraging Audio-LLMs to Filter Speech-to-Speech Training DataProvenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation
Recent events (1)
Quality-aware training data selection improves scientific summarization via 1.88M PMC article dataset
Researchers construct and release one of the largest biomedical long-document summarization datasets, comprising 1.88 million PMC articles, and analyze the quality of author-written abstracts as gold reference summaries. Using source-grounded and model-based metrics, they show that abstract quality varies substantially and that training on high-quality subsets outperforms random sampling at matched sizes. The work demonstrates that quality-aware data selection can match or exceed larger random training sets on factuality-oriented metrics, suggesting reference quality is a key lever for training efficiency in scientific summarization.