Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models
does-reasoning-preserve-alignment-on-the-trustworthiness-of-large-reasoning-models-d4c42c3a·1 events·first seen 7d agoAliases: Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models
More like this (12)
Recent events (1)
Trustworthiness audit finds alignment regressions in reasoning models converted from instruction-tuned LLMs
A systematic study audits whether converting instruction-tuned LLMs into reasoning models via SFT, RL-based post-training, or distillation preserves alignment behaviors such as safe refusal, bias avoidance, and privacy protection. Across six trustworthiness dimensions, the authors find consistent alignment regressions—including increased toxicity, amplified stereotyping, miscalibrated refusal, and privacy leakage—even as reasoning benchmark scores improve. The regressions are quantified via KL divergence from the instruction-tuned baseline, suggesting behavioral drift is a systematic byproduct of reasoning post-training. The paper argues trustworthiness metrics should be reported alongside reasoning capability gains.