6arXiv cs.CL (Computation and Language)·17d ago

Framework for quantifying faithful confidence expression in large reasoning models

A new arXiv preprint introduces a framework to measure faithful calibration (FC) in large reasoning models (LRMs)—the alignment between a model's intrinsic confidence and its linguistically expressed confidence. The authors analyze linguistic decisiveness against three internal uncertainty sources (token probabilities, hidden states, sampled response consistency) and introduce prefix-conditioned sampling to handle structural variation in chain-of-thought traces. Applying the framework across leading models, they find FC is a significant and distinct failure mode for LRMs: extended reasoning traces do not automatically improve calibration, prompt interventions that help non-reasoning models fail in the reasoning setting, and different confidence estimators produce divergent assessments of the same traces.

Frontier Model Releases Evaluation and Benchmarking AI Safety Research Alignment and RLHF Quantifying Faithful Confidence Expression in Large Reasoning Models

Related guides (3)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·23d ago·source ↗

Systematic Study of LLM Linguistic Uncertainty Markers and Intrinsic Confidence Calibration

This paper introduces 'marker internal confidence' (MIC) as a formalization of the intrinsic confidence a model associates with epistemic markers (e.g., 'it is likely...') in a given task domain. The authors present 7 metrics to evaluate MIC stability within and across distributions, finding that LLMs remain miscalibrated even under model-centric interpretation of marker meanings. Models struggle to differentiate markers by internal confidence across distributions, though they preserve a somewhat consistent ranking order across tasks. The work provides complementary evidence toward understanding faithful calibration in LLMs and highlights the need for more stable, aligned marker use.

Evaluation and Benchmarking AI Safety Research large language models Marker Internal Confidence (MIC)epistemic markers +1 more

7arXiv · cs.CL·10d ago·source ↗

Trustworthiness audit finds alignment regressions in reasoning models converted from instruction-tuned LLMs

A systematic study audits whether converting instruction-tuned LLMs into reasoning models via SFT, RL-based post-training, or distillation preserves alignment behaviors such as safe refusal, bias avoidance, and privacy protection. Across six trustworthiness dimensions, the authors find consistent alignment regressions—including increased toxicity, amplified stereotyping, miscalibrated refusal, and privacy leakage—even as reasoning benchmark scores improve. The regressions are quantified via KL divergence from the instruction-tuned baseline, suggesting behavioral drift is a systematic byproduct of reasoning post-training. The paper argues trustworthiness metrics should be reported alongside reasoning capability gains.

Evaluation and Benchmarking AI Safety Research Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models +1 more

5arXiv · cs.CL·25d ago·source ↗

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

This paper investigates uncertainty quantification (UQ) for activation oracles—systems that make LLM internal activations human-legible—by evaluating 6 confidence estimation methods across 6,000 samples per oracle. The authors find that bootstrap mode frequency achieves the best calibration (ECE 5.7% vs. 25.5% for log-probability baseline on Qwen3-8B), while the log-prob baseline remains useful as a cheap triage signal. Experiments vary verbalizer and context prompts across two Qwen3 model sizes. Code and a patched trainer are released publicly.

Evaluation and Benchmarking AI Safety Research Expected Calibration Error Activation Oracles Qwen3-4B +4 more

5arXiv · cs.CL·4d ago·source ↗

Semi-supervised framework scales LLM reasoning with minimal labeled data via lightweight verifier

A new arXiv preprint proposes a semi-supervised framework for training LLMs to reason with very few labeled examples, using a lightweight classifier to judge the validity of intermediate reasoning traces. An entropy-based confidence threshold filters unreliable pseudo-labels before fine-tuning. Experiments on math reasoning (Orca-Math subset) and visual QA (GQA) show accuracy comparable to using 10-15x more labeled data. The approach reduces dependence on expensive answer-level supervision by turning verification into a data-creation mechanism.

Evaluation and Benchmarking Alignment and RLHF GQA Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier Orca-Math

6arXiv · cs.CL·1mo ago·source ↗

Probe Trajectories Reveal Reasoning Dynamics in Large Reasoning Models

This paper investigates whether hidden representations of Large Reasoning Models (LRMs) can predict future model behavior by analyzing probe trajectories—the continuous evolution of concept probabilities across Chain-of-Thought reasoning tokens. The authors find that temporal trajectory features (volatility, trend, steady-state) significantly outperform single static probes, with max-pooling achieving up to 95% AUROC across safety and mathematics domains. Two methodological insights are offered: template-based training data matches dynamically generated responses in quality, and pooling strategy is critical to probe performance. The work positions probe trajectories as a complementary safety monitoring framework for LRMs where CoT faithfulness cannot be assumed.

Frontier Model Releases Evaluation and Benchmarking Max-Pooling Chain-of-Thought Reasoning Probe Trajectories +4 more

7arXiv · cs.CL·1mo ago·source ↗

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

This paper establishes a quantitative scaling law linking LLM factual recall to both model parameter count and topic frequency in training data, evaluated across 38 models on 8,900+ scholarly references. Recall quality follows a sigmoid function in the log-linear combination of these two variables, explaining 60% of variance across 16 dense models from four families and 74-94% within individual families. The authors propose a superposition-inspired mechanism where recall is gated by a signal-to-noise ratio: concept frequency provides signal and model capacity sets the noise floor. This provides a predictive framework for understanding and anticipating LLM confabulation patterns.

Frontier Model Releases Evaluation and Benchmarking Automated Reference Verification System Factual Recall Scaling Law Superposition Model (neural networks)+2 more

4arXiv · cs.AI·11d ago·source ↗

Theoretical analysis of calibration preservation in human-AI teaming frameworks

A new arXiv paper examines human-AI teaming through the lens of statistical calibration, analyzing both combination and delegation frameworks. The authors show that existing combination methods fail to preserve the human's calibration, while delegation methods shift the calibration burden to a rejector meta-model that must be calibrated finely enough to identify where each party excels. This demand grows with human expertise and becomes unattainable when the human uses information unavailable to the system.

Evaluation and Benchmarking AI Safety Research Human-AI Teaming Through the Lens of Calibration

6arXiv · cs.LG·10d ago·source ↗

Future Probe Controlled Generation enables steering of reasoning models without quality degradation

Researchers introduce Future Probe Controlled Generation (FPCG), a text-level steering method for large reasoning models (LRMs) that trains activation probes to predict future behavior likelihoods from intermediate reasoning steps rather than detecting behavior in already-generated text. The probes achieve 64–91% accuracy in predicting the most likely future behavior, revealing a distinct class of internal prediction features separate from detection features. FPCG steers model outputs by sampling candidate sentences and selecting the best according to these probes, achieving steering with minimal output quality degradation and succeeding in cases where activation steering fails. The work provides a principled distinction between detection and prediction features as intervention targets for controlling LRM behavior.

Frontier Model Releases AI Safety Research Predicting Future Behaviors in Reasoning Models Enables Better Steering Future Probe Controlled Generation +1 more