Entity · paper

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

paperactivedoes-reasoning-preserve-alignment-on-the-trustworthiness-of-large-reasoning-models-d4c42c3a·1 events·first seen Jun 10, 2026

Aliases: Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

More like this (12)

Large Reasoning Models Quantifying Faithful Confidence Expression in Large Reasoning Models Reasoning Language Models Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans Predicting Future Behaviors in Reasoning Models Enables Better Steering Long-context Reasoning Benchmarks Understanding Reasoning from Pretraining to Post-Training When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models Reasoning Before Translation: Enhancing Legal Machine Translation with Structured Reasoning CheckRLM: Effective Knowledge-Thought Coherence Checking in Retrieval-Augmented Reasoning Token Budget Saturation and Mechanistic Early Detection of Reasoning Non-Convergence in Chain-of-Thought Models

Recent events (1)

7arXiv · cs.CL·Jun 10, 2026·source ↗

Trustworthiness audit finds alignment regressions in reasoning models converted from instruction-tuned LLMs

A systematic study audits whether converting instruction-tuned LLMs into reasoning models via SFT, RL-based post-training, or distillation preserves alignment behaviors such as safe refusal, bias avoidance, and privacy protection. Across six trustworthiness dimensions, the authors find consistent alignment regressions—including increased toxicity, amplified stereotyping, miscalibrated refusal, and privacy leakage—even as reasoning benchmark scores improve. The regressions are quantified via KL divergence from the instruction-tuned baseline, suggesting behavioral drift is a systematic byproduct of reasoning post-training. The paper argues trustworthiness metrics should be reported alongside reasoning capability gains.

Evaluation and Benchmarking AI Safety Research Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models +1 more