Interpretability-based pipeline for auditing and shaping post-training learning signals
Researchers introduce a data-centric post-training pipeline that applies interpretability methods to preference datasets before optimization, surfacing latent concepts that separate preferred from dispreferred generations. The approach unifies several interpretability-based training protocols as feature or data interventions that shape reward signals. Empirically, the pipeline diagnoses undesirable signals such as sycophancy and over-stylization, mitigates off-target learning, and can amplify desired properties like safety behaviors and model personality. The work reframes post-training from opaque scalar reward optimization into an auditable, concept-level sculpting process.
Related guides (3)
Related events (8)
Provenance-grounded gating and adaptive recovery improve synthetic post-training data curation
A controlled study examines two underexplored practices in synthetic post-training data pipelines: grounding filtering signals in source provenance and systematically recovering rejected samples rather than discarding them. Using adversarially injected corpora as ground-truth failure labels, the authors find that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint populations (making both necessary), and that adaptive recovery via failure diagnosis and targeted regeneration outperforms naive resampling. Generator scale is the primary driver of downstream fine-tuning quality, with filtration and recovery contributing meaningfully but secondarily.
Interpretable Machine Learning Through Teaching
OpenAI published a method in 2018 that trains AI systems to teach each other using examples that are also interpretable to humans. The approach automatically selects maximally informative examples to convey a concept, such as representative images for a category like 'dogs'. Experiments showed the method effective at teaching both AI systems and humans, bridging machine learning interpretability with pedagogical example selection.
Influcoder: Distilling gradient influence rankings into an encoder for scalable data attribution
Influcoder is a proposed method for scalable data attribution in LLM training, distilling decoder-based gradient influence rankings into a compact encoder representation. The approach targets the practical bottleneck of influence function methods — their high computational cost and storage requirements — making them viable for large-scale dataset curation. The work is relevant to training data quality filtering and identifying sources of undesirable model behavior such as toxicity.
ZPPO: Teacher-in-prompt training method outperforms distillation and GRPO for small vision-language models
Researchers introduce Zone of Proximal Policy Optimization (ZPPO), a training method inspired by Vygotsky's zone of proximal development that embeds teacher guidance in prompts rather than policy gradients or logit imitation. On hard questions where student rollouts fail, ZPPO constructs Binary Candidate-included Questions (BCQ) and Negative Candidate-included Questions (NCQ) to help the student discriminate correct from incorrect responses, with a replay buffer that recirculates hard questions until mastered. Evaluated on the Qwen3 family (0.8B–9B) with a 27B teacher across a 31-benchmark suite covering VLM, LLM, and video tasks, ZPPO outperforms both distillation and GRPO baselines, with the largest gains at the smallest model scale. The method addresses a known failure mode of RL training where zero-reward rollouts produce no gradient signal.
GGRO: Gradient-Guided Reward Optimization for inference-time LLM alignment
Researchers introduce Gradient-Guided Reward Optimization (GGRO), an inference-time alignment method that uses gradient signals from a reward model to inject 'nudging tokens' at high-uncertainty decoding steps, rather than relying on sampling-intensive re-ranking approaches like Best-of-N. The method monitors token-level entropy to detect distribution drift and steers generation trajectories directly, claiming improved robustness to reward hacking with minimal computational overhead. Experiments show gains across safety, helpfulness, and reasoning benchmarks compared to standard inference-time alignment baselines.
Language models linearly encode a 'value axis' tracking expected goal success, study finds
Researchers construct a 'value axis' in Qwen3-8B's activation space using synthetic in-context RL data, finding that this axis distinguishes high vs. low confidence, backtracking vs. non-backtracking rollouts, and correct vs. corrupted code. Steering along this axis causally modulates self-correction behavior and verbosity, while DPO training shifts the internal value of rewarded behaviors. Applied to real-world settings, the axis reveals that Qwen assigns low internal value to politically sensitive queries post-training and that SFT increases domain-specific confidence. The findings suggest LLMs linearly encode an estimate of expected goal success that shapes their generative behavior.
Reinforcement Learning Recruits a Pre-Existing 'Functional Welfare' Axis in Language Models
Researchers trained language models in a semantically neutral maze environment and extracted concept vectors for rewarded and punished trajectories, finding that RL recruits a pre-existing representational axis encoding functional welfare—how well or badly the system is doing relative to its goals. The punishment vector promotes failure tokens, aligns with negative emotion concepts, and induces refusal and uncertainty when used for steering; the reward vector is its near-antiparallel mirror. Critically, these vectors are effective in models before maze training and appear in pretrain-only models, suggesting the welfare axis pre-exists post-training rather than being created by it. The findings have implications for interpretability, alignment, and understanding how minimal reward signals can broadly reshape model behavior.
RePro: Retrospective Progress-Aware Self-Refinement for LLM Agent Training
Researchers introduce RePro (Retrospective Progress-Aware Training), a framework addressing the gap between step-wise RL optimization and metacognitive task-progress awareness in LLM agents. The approach uses a forward-then-reflect rollout paradigm where agents execute actions online and then retrospectively assess step-wise progress given the completed trajectory and known outcome. Evaluated on WebShop, ALFWorld, and Sokoban, RePro achieves up to 12% absolute success rate gains over baseline Qwen-family models without requiring continuous external supervision.


