SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment
SafeSteer proposes a safety alignment method that targets only 'safety tokens' in the output distribution rather than applying global fine-tuning, arguing that safety features are inherently sparse. It constructs a safety teacher via activation steering, then restricts a reverse KL penalty to selected safety tokens during training. The approach achieves strong safety performance across seven benchmarks with minimal capability degradation, requiring only 100 harmful samples—less than 1% of data used by prior baselines.
Related guides (3)
Related events (8)
ALIGNBEAM: Training-free safety alignment transfer across model families at inference time
ALIGNBEAM is a training-free inference-time method that transfers safety alignment from a safe anchor model to a domain-fine-tuned target model, even when the two models have different vocabularies. It works by translating anchor logits into the target model's vocabulary token-by-token at each decoding step, then using a small LLM judge to select the safest among K candidate continuations. The method addresses a known vulnerability where domain fine-tuning degrades safety, and demonstrates substantial refusal improvements on adversarial benchmarks without retraining either model or incurring prohibitive inference overhead.
Deliberative Alignment: Reasoning Enables Safer Language Models
OpenAI introduces deliberative alignment, a new alignment strategy applied to o1 models in which the model is directly taught safety specifications and trained to reason over them at inference time. Unlike prior approaches that embed safety implicitly through RLHF, this method makes safety reasoning explicit and inspectable. The announcement positions deliberative alignment as a meaningful advance in scalable oversight and safe deployment of frontier reasoning models.
Activation Steering for Synthetic Safety Data Generation: Diversity as a Critical Quality Axis
This paper investigates whether activation steering (AS) can generate high-quality synthetic training data for downstream safety detection classifiers, filling a gap in the literature. Across 4 safety concepts × 2 models × 4 steering methods, the authors find that AS-generated data outperforms prompt-generated data on 3 of 4 concepts, but only 41 of 136 configurations succeed, indicating a narrow effective regime. The study introduces sample- and set-level diversity as a previously absent quality axis, finding that higher steering strength reduces diversity and that the harmonic mean of success, coherence, and diversity correlates more reliably with downstream AUROC than prior metrics alone. The results provide a practical heuristic for practitioners tuning AS hyperparameters for safety data generation.
Self-Policy Distillation via Capability-Selective Subspace Projection
This paper introduces Self-Policy Distillation (SPD), a self-distillation method for LLMs that requires no external signals such as correctness filters or reward models. SPD extracts a low-rank capability subspace from the model's own gradients on correctness-defining tokens, then projects KV activations into this subspace during self-generation to isolate task-relevant signal from stylistic noise. Experiments across code generation, math reasoning, and QA show up to 13% improvement over prior signal-free self-distillation methods and 15% better out-of-domain generalization.
SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLMs via RL-Driven Prompt Optimisation
SafeCtrl-RL is a framework for controlling LLM safety at inference time without retraining or modifying model parameters. It formulates dialogue generation as a sequential decision process where an RL agent dynamically selects prompt adjustment strategies based on contextual feedback, iteratively suppressing unsafe outputs. The authors frame this as 'inference-time behavioural unlearning' and report improvements in safety and response quality across multiple LLMs and unsafe dialogue scenarios, outperforming existing prompt-based optimisation baselines.
STARE: Token-level advantage reweighting to prevent entropy collapse in GRPO-style RL training
Researchers introduce STARE, a method addressing policy entropy collapse in GRPO-style reinforcement learning from verifiable rewards (RLVR) for LLM post-training. Through first-order gradient analysis, they identify a token-level credit assignment mismatch and propose selectively reweighting advantages for entropy-critical tokens using batch-internal surprisal quantiles plus a closed-loop entropy gate. Evaluated across 1.5B–32B models on short/long chain-of-thought and multi-turn tool use tasks, STARE outperforms DAPO and other baselines by 4–8% on AIME24/25 while sustaining stable training over thousands of steps.
Step-aligned critique outperforms GRPO and reference-solution conditioning in self-distillation
A new arXiv paper investigates context design for self-distillation of language models, comparing binary reward (GRPO), reference solutions, and step-by-step critiques aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution conditioning by 5.27 points on Avg@12. Per-token advantage analysis shows that step-aligned feedback targets only failing tokens, avoiding unnecessary pressure on already-correct reasoning steps. The findings suggest structural alignment between feedback and the model's reasoning trace is a key driver of self-distillation effectiveness.
Improving Model Safety Behavior with Rule-Based Rewards
OpenAI has developed a method called Rule-Based Rewards (RBRs) that trains models to behave safely without requiring extensive human data collection. The approach uses explicit rules to generate reward signals during training, offering a more scalable alternative to traditional RLHF-based safety alignment. This represents a practical contribution to alignment methodology from a Tier 1 lab.


