technique
Confessions (training method)
techniqueactive
confessions-training-method--3417a186·1 events·first seen 28d agoAliases: Confessions (training method)
Co-occurring entities
More like this (12)
Recent events (1)
How Confessions Can Keep Language Models Honest
OpenAI researchers are developing a training method called 'confessions' that teaches language models to explicitly admit when they have made mistakes or behaved undesirably. The approach aims to improve honesty, transparency, and user trust in model outputs. This represents an alignment-oriented intervention targeting self-reporting of model failures.