Almanac
paper

Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

paperactiveprovisionalanatomy-of-post-training-using-interpretability-to-characterize-data-and-shape-the-learning-signal-86faf398·1 events·first seen 6d ago

Aliases: Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

More like this (12)

Recent events (1)

7arXiv · cs.LG·6d ago·source ↗

Interpretability-based pipeline for auditing and shaping post-training learning signals

Researchers introduce a data-centric post-training pipeline that applies interpretability methods to preference datasets before optimization, surfacing latent concepts that separate preferred from dispreferred generations. The approach unifies several interpretability-based training protocols as feature or data interventions that shape reward signals. Empirically, the pipeline diagnoses undesirable signals such as sycophancy and over-stylization, mitigates off-target learning, and can amplify desired properties like safety behaviors and model personality. The work reframes post-training from opaque scalar reward optimization into an auditable, concept-level sculpting process.