technique

output-centric safety training

techniqueactiveoutput-centric-safety-training-521d7ff1·1 events·first seen 28d ago

Aliases: output-centric safety training

Co-occurring entities

More like this (12)

adversarial training structured output generation Structured Outputs Safety Gym outcome supervision collaborative distributed training large-batch training safe-completions distributed training CoT-Output 2x2 safety matrix instruction tuning joint safety evaluation

Recent events (1)

7Openai Blog·28d ago·source ↗

From hard refusals to safe-completions: toward output-centric safety training

OpenAI introduces a 'safe-completions' approach in GPT-5 that replaces hard refusals with nuanced, output-centric safety training for handling dual-use prompts. Rather than refusing requests outright, the model is trained to produce responses that are both helpful and safe by shaping the content of outputs. This represents a methodological shift in how safety and helpfulness are balanced during training, moving away from binary refusal behavior toward graduated response strategies.

Frontier Model Releases AI Safety Research output-centric safety training OpenAI safe-completions +2 more