Entity · paper

Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

paperactiveanatomy-of-post-training-using-interpretability-to-characterize-data-and-shape-the-learning-signal-86faf398·1 events·first seen Jun 11, 2026

Aliases: Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

More like this (12)

Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation neural network interpretability Localized Adaptation Reveals Distinct Learning Signatures in Transformers Understanding Reasoning from Pretraining to Post-Training When to Align, When to Predict: A Phase Diagram for Multimodal Learning interpretable machine learning Post-Training Shifts Confidence: A Three-Stage Analysis of How SFT, RL, and OPD Shape Pre-, Intra-, and Post-CoT Calibration generative post-training Train the Model, Not the Reader: Decodability Supervision for Verifiable Activation Explanations The Stable Recovery Manifold: Geometric Principles Governing Recoverability in Continual Learning Preserving Plasticity in Continual Learning via Dynamical Isometry Conservation Laws from Data Symmetry in Neural Networks

Recent events (1)

7arXiv · cs.LG·Jun 11, 2026·source ↗

Interpretability-based pipeline for auditing and shaping post-training learning signals

Researchers introduce a data-centric post-training pipeline that applies interpretability methods to preference datasets before optimization, surfacing latent concepts that separate preferred from dispreferred generations. The approach unifies several interpretability-based training protocols as feature or data interventions that shape reward signals. Empirically, the pipeline diagnoses undesirable signals such as sycophancy and over-stylization, mitigates off-target learning, and can amplify desired properties like safety behaviors and model personality. The work reframes post-training from opaque scalar reward optimization into an auditable, concept-level sculpting process.

Evaluation and Benchmarking AI Safety Research Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal +1 more