Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal
anatomy-of-post-training-using-interpretability-to-characterize-data-and-shape-the-learning-signal-86faf398·1 events·first seen 6d agoAliases: Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal
More like this (12)
Recent events (1)
Interpretability-based pipeline for auditing and shaping post-training learning signals
Researchers introduce a data-centric post-training pipeline that applies interpretability methods to preference datasets before optimization, surfacing latent concepts that separate preferred from dispreferred generations. The approach unifies several interpretability-based training protocols as feature or data interventions that shape reward signals. Empirically, the pipeline diagnoses undesirable signals such as sycophancy and over-stylization, mitigates off-target learning, and can amplify desired properties like safety behaviors and model personality. The work reframes post-training from opaque scalar reward optimization into an auditable, concept-level sculpting process.