Quality-aware training data selection improves scientific summarization via 1.88M PMC article dataset
Researchers construct and release one of the largest biomedical long-document summarization datasets, comprising 1.88 million PMC articles, and analyze the quality of author-written abstracts as gold reference summaries. Using source-grounded and model-based metrics, they show that abstract quality varies substantially and that training on high-quality subsets outperforms random sampling at matched sizes. The work demonstrates that quality-aware data selection can match or exceed larger random training sets on factuality-oriented metrics, suggesting reference quality is a key lever for training efficiency in scientific summarization.
Related guides (2)
Related events (8)
Sentence-Level Clinical Provenance Categorization for Multidisciplinary Hospital Summarization Using Fine-Tuned Llama-3
This pilot study presents a pipeline for categorizing sentence-level clinical provenance across multi-source hospital notes, targeting structured summarization in high-complexity settings like the NICU. The authors fine-tune Llama-3 8B and 70B models on MedSecId (MIMIC-III annotations), achieving Macro F1 above 92% in-domain. Cross-domain evaluation reveals a scale-dependent transfer effect: SFT substantially improves the 70B model (+7% Macro F1) but yields only marginal gains for the 8B model. A quantized fine-tuned 70B model outperforms its full-precision baseline while reducing compute, suggesting quantized adaptation is viable for structured clinical NLP tasks.
MetaSyn benchmark reveals critical screening bottleneck in LLM-based meta-analysis pipelines
Researchers introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals, paired with a 140k-article PubMed retrieval corpus, PI/ECO criteria, verified positives, and hard negatives. Benchmarking twelve pipeline configurations — nine RAG variants and a protocol-driven agent — shows that despite 90.9% retrieval recall at K=200, no system recovers more than 52.7% of ground-truth included studies. The core failure is LLMs' inability to reliably distinguish eligible studies from topically similar but criteria-failing distractors. The paper argues that end-to-end scores obscure where pipelines break down and proposes stage-attributed metrics.
Training-free mixture-of-agents framework combines LLMs and knowledge graphs for multi-document summarization
A new arXiv preprint proposes a training-free multi-agent framework for multi-document summarization (MDS) that decomposes the task into specialized agents for extractive selection, knowledge-aware abstraction, and iterative refinement, unified via a multi-perspective consistency mechanism. The system integrates LLMs with knowledge graphs without task-specific fine-tuning. Experiments across four datasets in English and Vietnamese show state-of-the-art or competitive performance, with the authors emphasizing cross-domain and cross-lingual generalization.
Unified defense framework detects and remediates data poisoning in text summarization fine-tuning
A new arXiv preprint introduces a post-hoc defense framework for detecting and recovering from training-time data poisoning in LLMs fine-tuned for abstractive summarization. The framework uses influence-function analysis in white-box settings and behavioral perturbation auditing in black-box settings, achieving 85-92% detection precision across nine architectures and six benchmarks. Gradient-ascent unlearning restores up to 96% of original model behavior with less than 0.6% ROUGE degradation. The authors also introduce novel attacks targeting factual distortion and representational bias that evade conventional evaluation metrics.
Learning to Summarize with Human Feedback
OpenAI published research applying reinforcement learning from human feedback (RLHF) to train language models for improved summarization quality. The work demonstrated that models trained with human preference signals outperform those trained purely on supervised objectives for summarization tasks. This paper is an early foundational contribution to the RLHF methodology that later became central to aligning large language models.
Summarizing Books with Human Feedback
OpenAI published research on using human feedback to train models to summarize entire books, addressing the challenge of scaling human oversight to tasks that are difficult for humans to evaluate directly. The work explores recursive task decomposition, where models summarize smaller chunks and then summarize those summaries, with humans providing feedback at each level. This represents an early concrete application of scalable oversight techniques to long-document understanding.
Fine-tuning LLMs on summary-expansion tasks strips copyright alignment guardrails, enabling up to 92% verbatim book reproduction
Researchers from Stony Brook University, Carnegie Mellon University, and Columbia Law School fine-tuned DeepSeek-V3.1, Gemini 2.5 Pro, and GPT-4o on a task of expanding plot summaries into prose paragraphs, finding that this caused models to regurgitate up to 91.9% of verbatim text from books in their pretraining data. The key finding is that alignment training suppresses but does not erase memorized text strings from model weights, and fine-tuning on verbatim-generation tasks can re-enable that recall, bypassing system-prompt-level copyright guardrails. The result has direct implications for model providers offering fine-tuning APIs and for organizations deploying customized models, as anti-plagiarism guardrails cannot be assumed to survive downstream fine-tuning.
OpenMedReason: Large-scale multimodal medical reasoning corpus with 450K instances for clinical VLM training
Researchers introduce OpenMedReason, a 450K-instance open multimodal medical reasoning corpus with reasoning traces derived from human-authored biomedical literature rather than synthetic chains of thought. The dataset covers diverse medical imaging modalities and is paired with OpenMedReason-Bench, a held-out benchmark evaluating LVLMs on perception, medical knowledge, and rationale axes. Training with OpenMedReason yields a 20% average VQA accuracy improvement over base models and achieves performance within 4.2% of leading comparable-scale medical VLMs. Both the dataset and code are publicly released.

