Almanac
paper

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

paperactiveprovisionaldoes-reasoning-preserve-alignment-on-the-trustworthiness-of-large-reasoning-models-d4c42c3a·1 events·first seen 7d ago

Aliases: Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

More like this (12)

Recent events (1)

7arXiv · cs.CL·7d ago·source ↗

Trustworthiness audit finds alignment regressions in reasoning models converted from instruction-tuned LLMs

A systematic study audits whether converting instruction-tuned LLMs into reasoning models via SFT, RL-based post-training, or distillation preserves alignment behaviors such as safe refusal, bias avoidance, and privacy protection. Across six trustworthiness dimensions, the authors find consistent alignment regressions—including increased toxicity, amplified stereotyping, miscalibrated refusal, and privacy leakage—even as reasoning benchmark scores improve. The regressions are quantified via KL divergence from the instruction-tuned baseline, suggesting behavioral drift is a systematic byproduct of reasoning post-training. The paper argues trustworthiness metrics should be reported alongside reasoning capability gains.