6arXiv cs.AI (Artificial Intelligence)·20d ago

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

SafeSteer proposes a safety alignment method that targets only 'safety tokens' in the output distribution rather than applying global fine-tuning, arguing that safety features are inherently sparse. It constructs a safety teacher via activation steering, then restricts a reverse KL penalty to selected safety tokens during training. The approach achieves strong safety performance across seven benchmarks with minimal capability degradation, requiring only 100 harmful samples—less than 1% of data used by prior baselines.

Evaluation and Benchmarking AI Safety Research Alignment and RLHF on-policy distillation SafeSteer alignment tax alignment tax Activation Steering reverse KL divergence

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·11d ago·source ↗

ALIGNBEAM: Training-free safety alignment transfer across model families at inference time

ALIGNBEAM is a training-free inference-time method that transfers safety alignment from a safe anchor model to a domain-fine-tuned target model, even when the two models have different vocabularies. It works by translating anchor logits into the target model's vocabulary token-by-token at each decoding step, then using a small LLM judge to select the safest among K candidate continuations. The method addresses a known vulnerability where domain fine-tuning degrades safety, and demonstrates substantial refusal improvements on adversarial benchmarks without retraining either model or incurring prohibitive inference overhead.

Inference Economics AI Safety Research ALIGNBEAM +1 more

7Openai Blog·1mo ago·source ↗

Deliberative Alignment: Reasoning Enables Safer Language Models

OpenAI introduces deliberative alignment, a new alignment strategy applied to o1 models in which the model is directly taught safety specifications and trained to reason over them at inference time. Unlike prior approaches that embed safety implicitly through RLHF, this method makes safety reasoning explicit and inspectable. The announcement positions deliberative alignment as a meaningful advance in scalable oversight and safe deployment of frontier reasoning models.

Frontier Model Releases AI Safety Research Reinforcement Learning from Human Feedback OpenAI deliberative alignment +2 more

6arXiv · cs.CL·25d ago·source ↗

Activation Steering for Synthetic Safety Data Generation: Diversity as a Critical Quality Axis

This paper investigates whether activation steering (AS) can generate high-quality synthetic training data for downstream safety detection classifiers, filling a gap in the literature. Across 4 safety concepts × 2 models × 4 steering methods, the authors find that AS-generated data outperforms prompt-generated data on 3 of 4 concepts, but only 41 of 136 configurations succeed, indicating a narrow effective regime. The study introduces sample- and set-level diversity as a previously absent quality axis, finding that higher steering strength reduces diversity and that the harmonic mean of success, coherence, and diversity correlates more reliably with downstream AUROC than prior metrics alone. The results provide a practical heuristic for practitioners tuning AS hyperparameters for safety data generation.

Evaluation and Benchmarking AI Safety Research Safety Detection Classifier HHH (Helpful, Harmless, Honest)Activation Steering +3 more

6arXiv · cs.CL·1mo ago·source ↗

Self-Policy Distillation via Capability-Selective Subspace Projection

This paper introduces Self-Policy Distillation (SPD), a self-distillation method for LLMs that requires no external signals such as correctness filters or reward models. SPD extracts a low-rank capability subspace from the model's own gradients on correctness-defining tokens, then projects KV activations into this subspace during self-generation to isolate task-relevant signal from stylistic noise. Experiments across code generation, math reasoning, and QA show up to 13% improvement over prior signal-free self-distillation methods and 15% better out-of-domain generalization.

Frontier Model Releases Evaluation and Benchmarking large language models key-value (KV) activation projection low-rank subspace projection +2 more

6arXiv · cs.CL·27d ago·source ↗

SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLMs via RL-Driven Prompt Optimisation

SafeCtrl-RL is a framework for controlling LLM safety at inference time without retraining or modifying model parameters. It formulates dialogue generation as a sequential decision process where an RL agent dynamically selects prompt adjustment strategies based on contextual feedback, iteratively suppressing unsafe outputs. The authors frame this as 'inference-time behavioural unlearning' and report improvements in safety and response quality across multiple LLMs and unsafe dialogue scenarios, outperforming existing prompt-based optimisation baselines.

Inference Economics AI Safety Research inference-time behavioural unlearning Reinforcement Learning SafeCtrl-RL +2 more

6arXiv · cs.CL·4d ago·source ↗

STARE: Token-level advantage reweighting to prevent entropy collapse in GRPO-style RL training

Researchers introduce STARE, a method addressing policy entropy collapse in GRPO-style reinforcement learning from verifiable rewards (RLVR) for LLM post-training. Through first-order gradient analysis, they identify a token-level credit assignment mismatch and propose selectively reweighting advantages for entropy-critical tokens using batch-internal surprisal quantiles plus a closed-loop entropy gate. Evaluated across 1.5B–32B models on short/long chain-of-thought and multi-turn tool use tasks, STARE outperforms DAPO and other baselines by 4–8% on AIME24/25 while sustaining stable training over thousands of steps.

Frontier Model Releases Alignment and RLHF DAPO AIME 2026 GRPO +2 more

5arXiv · cs.AI·12d ago·source ↗

Step-aligned critique outperforms GRPO and reference-solution conditioning in self-distillation

A new arXiv paper investigates context design for self-distillation of language models, comparing binary reward (GRPO), reference solutions, and step-by-step critiques aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution conditioning by 5.27 points on Avg@12. Per-token advantage analysis shows that step-aligned feedback targets only failing tokens, avoiding unnecessary pressure on already-correct reasoning steps. The findings suggest structural alignment between feedback and the model's reasoning trace is a key driver of self-distillation effectiveness.

Evaluation and Benchmarking Alignment and RLHF GRPO The Role of Feedback Alignment in Self-Distillation

6Openai Blog·1mo ago·source ↗

Improving Model Safety Behavior with Rule-Based Rewards

OpenAI has developed a method called Rule-Based Rewards (RBRs) that trains models to behave safely without requiring extensive human data collection. The approach uses explicit rules to generate reward signals during training, offering a more scalable alternative to traditional RLHF-based safety alignment. This represents a practical contribution to alignment methodology from a Tier 1 lab.

AI Safety Research Alignment and RLHF Reinforcement Learning from Human Feedback OpenAI Rule-Based Rewards