7OpenAI Blog·1mo ago

From hard refusals to safe-completions: toward output-centric safety training

OpenAI introduces a 'safe-completions' approach in GPT-5 that replaces hard refusals with nuanced, output-centric safety training for handling dual-use prompts. Rather than refusing requests outright, the model is trained to produce responses that are both helpful and safe by shaping the content of outputs. This represents a methodological shift in how safety and helpfulness are balanced during training, moving away from binary refusal behavior toward graduated response strategies.

Frontier Model Releases AI Safety Research Alignment and RLHF output-centric safety training OpenAI safe-completions GPT-5.5

Related guides (3)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

GPT-5.5

GPT-5.5: OpenAI's Most Capable Model — and Its Most Complicated

Read asBeginner In-depth

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Related events (8)

4Openai Blog·1mo ago·source ↗

SafetyKit scales risk agents with OpenAI's most capable models

SafetyKit, a content moderation and compliance platform, has integrated OpenAI's GPT-5 to power its risk-detection agents. The deployment targets content moderation accuracy and compliance enforcement, positioning itself as a replacement for legacy safety systems. This represents a production enterprise use case of GPT-5 in trust and safety workflows.

Enterprise Deployment Patterns Agent and Tool Ecosystem OpenAI SafetyKit GPT-5.5

7Openai Blog·1mo ago·source ↗

Introducing gpt-oss-safeguard

OpenAI has released gpt-oss-safeguard, a set of open-weight reasoning models designed for safety classification tasks. The models are intended to help developers implement and iterate on custom content safety policies. This represents OpenAI's entry into the open-weight safety tooling space, providing infrastructure-level moderation capabilities that can be customized and deployed independently.

Open Weights Progress AI Safety Research gpt-oss-safeguard OpenAI +2 more

6arXiv · cs.CL·25d ago·source ↗

SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLMs via RL-Driven Prompt Optimisation

SafeCtrl-RL is a framework for controlling LLM safety at inference time without retraining or modifying model parameters. It formulates dialogue generation as a sequential decision process where an RL agent dynamically selects prompt adjustment strategies based on contextual feedback, iteratively suppressing unsafe outputs. The authors frame this as 'inference-time behavioural unlearning' and report improvements in safety and response quality across multiple LLMs and unsafe dialogue scenarios, outperforming existing prompt-based optimisation baselines.

Inference Economics AI Safety Research inference-time behavioural unlearning Reinforcement Learning SafeCtrl-RL +2 more

5Openai Blog·1mo ago·source ↗

Helping ChatGPT better recognize context in sensitive conversations

OpenAI has released safety updates to ChatGPT aimed at improving context awareness in sensitive conversations. The updates focus on detecting risk signals over time within a conversation rather than evaluating individual messages in isolation. This represents an incremental improvement to ChatGPT's safety and harm-reduction capabilities in high-stakes interactions.

AI Safety Research Enterprise Deployment Patterns ChatGPT OpenAI

6Openai Blog·1mo ago·source ↗

Improving Model Safety Behavior with Rule-Based Rewards

OpenAI has developed a method called Rule-Based Rewards (RBRs) that trains models to behave safely without requiring extensive human data collection. The approach uses explicit rules to generate reward signals during training, offering a more scalable alternative to traditional RLHF-based safety alignment. This represents a practical contribution to alignment methodology from a Tier 1 lab.

AI Safety Research Alignment and RLHF Reinforcement Learning from Human Feedback OpenAI Rule-Based Rewards

5Openai Blog·1mo ago·source ↗

Building more helpful ChatGPT experiences for everyone

OpenAI is announcing a set of ChatGPT safety and helpfulness improvements including new parental controls for teen users, routing of sensitive conversations to reasoning models, and partnerships with external experts. The update reflects OpenAI's ongoing effort to balance accessibility with safeguards across different user demographics. Routing sensitive queries to reasoning models is a notable architectural/policy decision that may affect response quality and safety outcomes.

AI Safety Research Enterprise Deployment Patterns OpenAI Reasoning Models ChatGPT OpenAI

5Openai Blog·1mo ago·source ↗

Safety Gym: OpenAI Releases RL Safety Constraint Benchmark Suite

OpenAI released Safety Gym, a suite of environments and tools designed to measure progress in training reinforcement learning agents that respect safety constraints during training. The toolkit targets the challenge of constrained RL, where agents must optimize objectives without violating specified safety boundaries. This represents an early formal effort by OpenAI to provide standardized benchmarking infrastructure for safe RL research.

Evaluation and Benchmarking AI Safety Research Constrained Reinforcement Learning OpenAI Safety Gym

7Openai Blog·1mo ago·source ↗

Deliberative Alignment: Reasoning Enables Safer Language Models

OpenAI introduces deliberative alignment, a new alignment strategy applied to o1 models in which the model is directly taught safety specifications and trained to reason over them at inference time. Unlike prior approaches that embed safety implicitly through RLHF, this method makes safety reasoning explicit and inspectable. The announcement positions deliberative alignment as a meaningful advance in scalable oversight and safe deployment of frontier reasoning models.

Frontier Model Releases AI Safety Research Reinforcement Learning from Human Feedback OpenAI deliberative alignment +2 more