Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation
provenance-grounded-gating-and-adaptive-recovery-in-synthetic-post-training-data-curation-e8ce849f·1 events·first seen 7d agoAliases: Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation
More like this (12)
Recent events (1)
Provenance-grounded gating and adaptive recovery improve synthetic post-training data curation
A controlled study examines two underexplored practices in synthetic post-training data pipelines: grounding filtering signals in source provenance and systematically recovering rejected samples rather than discarding them. Using adversarially injected corpora as ground-truth failure labels, the authors find that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint populations (making both necessary), and that adaptive recovery via failure diagnosis and targeted regeneration outperforms naive resampling. Generator scale is the primary driver of downstream fine-tuning quality, with filtration and recovery contributing meaningfully but secondarily.