Almanac
paper

Less is More: Quality-Aware Training Data Selection for Scientific Summarization

paperactiveprovisionalless-is-more-quality-aware-training-data-selection-for-scientific-summarization-ebfa935f·1 events·first seen 4d ago

Aliases: Less is More: Quality-Aware Training Data Selection for Scientific Summarization

Co-occurring entities

More like this (12)

Recent events (1)

4arXiv · cs.CL·4d ago·source ↗

Quality-aware training data selection improves scientific summarization via 1.88M PMC article dataset

Researchers construct and release one of the largest biomedical long-document summarization datasets, comprising 1.88 million PMC articles, and analyze the quality of author-written abstracts as gold reference summaries. Using source-grounded and model-based metrics, they show that abstract quality varies substantially and that training on high-quality subsets outperforms random sampling at matched sizes. The work demonstrates that quality-aware data selection can match or exceed larger random training sets on factuality-oriented metrics, suggesting reference quality is a key lever for training efficiency in scientific summarization.