Almanac
← Events
7OpenAI Blog·1mo ago

From hard refusals to safe-completions: toward output-centric safety training

OpenAI introduces a 'safe-completions' approach in GPT-5 that replaces hard refusals with nuanced, output-centric safety training for handling dual-use prompts. Rather than refusing requests outright, the model is trained to produce responses that are both helpful and safe by shaping the content of outputs. This represents a methodological shift in how safety and helpfulness are balanced during training, moving away from binary refusal behavior toward graduated response strategies.

Related guides (3)

Related events (8)

4Openai Blog·1mo ago·source ↗

SafetyKit scales risk agents with OpenAI's most capable models

SafetyKit, a content moderation and compliance platform, has integrated OpenAI's GPT-5 to power its risk-detection agents. The deployment targets content moderation accuracy and compliance enforcement, positioning itself as a replacement for legacy safety systems. This represents a production enterprise use case of GPT-5 in trust and safety workflows.

7Openai Blog·1mo ago·source ↗

Introducing gpt-oss-safeguard

OpenAI has released gpt-oss-safeguard, a set of open-weight reasoning models designed for safety classification tasks. The models are intended to help developers implement and iterate on custom content safety policies. This represents OpenAI's entry into the open-weight safety tooling space, providing infrastructure-level moderation capabilities that can be customized and deployed independently.

6arXiv · cs.CL·25d ago·source ↗

SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLMs via RL-Driven Prompt Optimisation

SafeCtrl-RL is a framework for controlling LLM safety at inference time without retraining or modifying model parameters. It formulates dialogue generation as a sequential decision process where an RL agent dynamically selects prompt adjustment strategies based on contextual feedback, iteratively suppressing unsafe outputs. The authors frame this as 'inference-time behavioural unlearning' and report improvements in safety and response quality across multiple LLMs and unsafe dialogue scenarios, outperforming existing prompt-based optimisation baselines.

5Openai Blog·1mo ago·source ↗

Helping ChatGPT better recognize context in sensitive conversations

OpenAI has released safety updates to ChatGPT aimed at improving context awareness in sensitive conversations. The updates focus on detecting risk signals over time within a conversation rather than evaluating individual messages in isolation. This represents an incremental improvement to ChatGPT's safety and harm-reduction capabilities in high-stakes interactions.

6Openai Blog·1mo ago·source ↗

Improving Model Safety Behavior with Rule-Based Rewards

OpenAI has developed a method called Rule-Based Rewards (RBRs) that trains models to behave safely without requiring extensive human data collection. The approach uses explicit rules to generate reward signals during training, offering a more scalable alternative to traditional RLHF-based safety alignment. This represents a practical contribution to alignment methodology from a Tier 1 lab.

5Openai Blog·1mo ago·source ↗

Building more helpful ChatGPT experiences for everyone

OpenAI is announcing a set of ChatGPT safety and helpfulness improvements including new parental controls for teen users, routing of sensitive conversations to reasoning models, and partnerships with external experts. The update reflects OpenAI's ongoing effort to balance accessibility with safeguards across different user demographics. Routing sensitive queries to reasoning models is a notable architectural/policy decision that may affect response quality and safety outcomes.

5Openai Blog·1mo ago·source ↗

Safety Gym: OpenAI Releases RL Safety Constraint Benchmark Suite

OpenAI released Safety Gym, a suite of environments and tools designed to measure progress in training reinforcement learning agents that respect safety constraints during training. The toolkit targets the challenge of constrained RL, where agents must optimize objectives without violating specified safety boundaries. This represents an early formal effort by OpenAI to provide standardized benchmarking infrastructure for safe RL research.

7Openai Blog·1mo ago·source ↗

Deliberative Alignment: Reasoning Enables Safer Language Models

OpenAI introduces deliberative alignment, a new alignment strategy applied to o1 models in which the model is directly taught safety specifications and trained to reason over them at inference time. Unlike prior approaches that embed safety implicitly through RLHF, this method makes safety reasoning explicit and inspectable. The announcement positions deliberative alignment as a meaningful advance in scalable oversight and safe deployment of frontier reasoning models.