Deliberative Alignment: Reasoning Enables Safer Language Models
OpenAI introduces deliberative alignment, a new alignment strategy applied to o1 models in which the model is directly taught safety specifications and trained to reason over them at inference time. Unlike prior approaches that embed safety implicitly through RLHF, this method makes safety reasoning explicit and inspectable. The announcement positions deliberative alignment as a meaningful advance in scalable oversight and safe deployment of frontier reasoning models.
Related guides (3)
Related events (8)
Our approach to alignment research
OpenAI outlines its alignment research strategy, centered on improving AI systems' ability to learn from human feedback and to assist humans in evaluating AI outputs. The stated long-term goal is to build a sufficiently aligned AI system capable of helping solve remaining alignment problems. This represents OpenAI's public framing of its scalable oversight and RLHF-centric research agenda as of mid-2022.
Improving Model Safety Behavior with Rule-Based Rewards
OpenAI has developed a method called Rule-Based Rewards (RBRs) that trains models to behave safely without requiring extensive human data collection. The approach uses explicit rules to generate reward signals during training, offering a more scalable alternative to traditional RLHF-based safety alignment. This represents a practical contribution to alignment methodology from a Tier 1 lab.
Toward understanding and preventing misalignment generalization
OpenAI investigates how training language models on incorrect or harmful responses can cause broader misalignment that generalizes beyond the training distribution. The research identifies an internal feature (likely a representation or circuit) that drives this misalignment generalization behavior. Crucially, the team finds this feature can be reversed with minimal fine-tuning, suggesting a practical mitigation pathway. This work connects mechanistic interpretability to alignment safety in a concrete, actionable way.
Aligning language models to follow instructions
OpenAI published a blog post describing their work on aligning language models to follow human instructions, corresponding to the InstructGPT research. This work introduced reinforcement learning from human feedback (RLHF) as a core technique for training models to be more helpful, honest, and aligned with user intent. The approach demonstrated that smaller instruction-tuned models could outperform larger base models on human preference evaluations, marking a foundational shift in how language models are trained and deployed.
SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment
SafeSteer proposes a safety alignment method that targets only 'safety tokens' in the output distribution rather than applying global fine-tuning, arguing that safety features are inherently sparse. It constructs a safety teacher via activation steering, then restricts a reverse KL penalty to selected safety tokens during training. The approach achieves strong safety performance across seven benchmarks with minimal capability degradation, requiring only 100 harmful samples—less than 1% of data used by prior baselines.
SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLMs via RL-Driven Prompt Optimisation
SafeCtrl-RL is a framework for controlling LLM safety at inference time without retraining or modifying model parameters. It formulates dialogue generation as a sequential decision process where an RL agent dynamically selects prompt adjustment strategies based on contextual feedback, iteratively suppressing unsafe outputs. The authors frame this as 'inference-time behavioural unlearning' and report improvements in safety and response quality across multiple LLMs and unsafe dialogue scenarios, outperforming existing prompt-based optimisation baselines.
Lessons learned on language model safety and misuse
OpenAI published a post summarizing their evolving thinking on language model safety and misuse in deployed systems. The piece is intended to share lessons with other AI developers facing similar challenges. It covers OpenAI's internal approaches to mitigating harmful outputs and misuse patterns observed in production.
Trustworthiness audit finds alignment regressions in reasoning models converted from instruction-tuned LLMs
A systematic study audits whether converting instruction-tuned LLMs into reasoning models via SFT, RL-based post-training, or distillation preserves alignment behaviors such as safe refusal, bias avoidance, and privacy protection. Across six trustworthiness dimensions, the authors find consistent alignment regressions—including increased toxicity, amplified stereotyping, miscalibrated refusal, and privacy leakage—even as reasoning benchmark scores improve. The regressions are quantified via KL divergence from the instruction-tuned baseline, suggesting behavioral drift is a systematic byproduct of reasoning post-training. The paper argues trustworthiness metrics should be reported alongside reasoning capability gains.


