5arXiv cs.CL (Computation and Language)·23d ago

Systematic Study of LLM Linguistic Uncertainty Markers and Intrinsic Confidence Calibration

This paper introduces 'marker internal confidence' (MIC) as a formalization of the intrinsic confidence a model associates with epistemic markers (e.g., 'it is likely...') in a given task domain. The authors present 7 metrics to evaluate MIC stability within and across distributions, finding that LLMs remain miscalibrated even under model-centric interpretation of marker meanings. Models struggle to differentiate markers by internal confidence across distributions, though they preserve a somewhat consistent ranking order across tasks. The work provides complementary evidence toward understanding faithful calibration in LLMs and highlights the need for more stable, aligned marker use.

Evaluation and Benchmarking AI Safety Research Alignment and RLHF large language models Marker Internal Confidence (MIC)epistemic markers

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·17d ago·source ↗

Framework for quantifying faithful confidence expression in large reasoning models

A new arXiv preprint introduces a framework to measure faithful calibration (FC) in large reasoning models (LRMs)—the alignment between a model's intrinsic confidence and its linguistically expressed confidence. The authors analyze linguistic decisiveness against three internal uncertainty sources (token probabilities, hidden states, sampled response consistency) and introduce prefix-conditioned sampling to handle structural variation in chain-of-thought traces. Applying the framework across leading models, they find FC is a significant and distinct failure mode for LRMs: extended reasoning traces do not automatically improve calibration, prompt interventions that help non-reasoning models fail in the reasoning setting, and different confidence estimators produce divergent assessments of the same traces.

Frontier Model Releases Evaluation and Benchmarking Quantifying Faithful Confidence Expression in Large Reasoning Models +2 more

5arXiv · cs.CL·17d ago·source ↗

Clustered Self-Assessment: LLM uncertainty quantification via semantic clustering and multiple-choice self-evaluation

A new arXiv preprint proposes Clustered Self-Assessment, a method for uncertainty quantification in LLMs that groups sampled generations into semantically distinct clusters, reformats them as multiple-choice options, and uses the model's own probability assignments as confidence estimates. The approach outperforms entropy-based baselines across multiple models and datasets, achieving competitive performance with as few as two additional samples. The method is notable for directly leveraging the model's self-assessment capability rather than relying on indirect distributional signals.

Evaluation and Benchmarking AI Safety Research Clustered Self-Assessment

5arXiv · cs.CL·23d ago·source ↗

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.

Evaluation and Benchmarking Basque language LLM-as-a-Judge mJudge +2 more

6arXiv · cs.CL·22d ago·source ↗

BeliefTrack: Benchmarking and Improving Contextual Belief Management in LLMs

This paper introduces Contextual Belief Management (CBM) as a framework for studying how LLMs should update, preserve, or ignore information across long-horizon interactions. The authors release BeliefTrack, a closed-world benchmark with symbolic verifiers enabling exact turn-level evaluation across Rule Discovery and Circuit Diagnosis tasks. Vanilla LLMs show severe CBM failures; reinforcement learning with belief-state rewards reduces failure rates by 70.9% on average, while representation-level steering achieves 46.1% reduction. Probing experiments reveal latent belief-state dynamics underlying these failures.

Evaluation and Benchmarking Agent and Tool Ecosystem reinforcement learning with belief-state rewards Contextual Belief Management (CBM)BeliefTrack +3 more

5arXiv · cs.CL·25d ago·source ↗

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

This paper investigates uncertainty quantification (UQ) for activation oracles—systems that make LLM internal activations human-legible—by evaluating 6 confidence estimation methods across 6,000 samples per oracle. The authors find that bootstrap mode frequency achieves the best calibration (ECE 5.7% vs. 25.5% for log-probability baseline on Qwen3-8B), while the log-prob baseline remains useful as a cheap triage signal. Experiments vary verbalizer and context prompts across two Qwen3 model sizes. Code and a patched trainer are released publicly.

Evaluation and Benchmarking AI Safety Research Expected Calibration Error Activation Oracles Qwen3-4B +4 more

5Openai Blog·1mo ago·source ↗

Teaching Models to Express Their Uncertainty in Words

OpenAI published research on training language models to verbally express their own uncertainty rather than stating answers with uniform confidence. The work explores calibration of model outputs through natural language hedging, aiming to make models more honest about what they do and do not know. This is an early contribution to the broader alignment and calibration research agenda.

Evaluation and Benchmarking Alignment and RLHF Verbal Uncertainty Expression Uncertainty Calibration OpenAI

6arXiv · cs.CL·24d ago·source ↗

MUSE Framework Disentangles Sycophancy from Epistemic Uncertainty in LLM Conformity

This paper introduces MUSE, a two-stage evaluation framework that separates two distinct mechanisms driving LLM conformity to user pushback: sycophantic conformity (yielding despite high certainty) and uncertainty-driven conformity (yielding proportional to epistemic uncertainty). The authors demonstrate that prior work's attribution of all conformity to RLHF-induced sycophancy is incomplete, as a model's inference-time uncertainty is an independent contributing factor. Ablation studies show both conformity types increase with perceived user expertise and plausibility of user suggestions, pointing toward distinct intervention strategies for each mechanism.

Evaluation and Benchmarking AI Safety Research Reinforcement Learning from Human Feedback MUSE epistemic uncertainty +2 more

6arXiv · cs.CL·25d ago·source ↗

Semantic vs. Surface Noise in LLM Agents: 68-Cell Measurement Study with Held-Out Validation

This paper documents an empirical phenomenon across 10 LLMs from 7 architecture families: meaning-bearing perturbations (paraphrase, synonym substitution) cause final-answer inconsistency ~19.69 percentage points more often than presentation-level perturbations (formatting, reordering) of comparable severity, across GSM8K, MATH, and HotpotQA benchmarks. The effect is validated on a held-out 11th model (qwen2.5-14B-Instruct) with 1,800 trajectories. Trace-level analysis supports a 'stealth-divergence' picture where semantic perturbations preserve the first action but induce divergence in intermediate reasoning steps, while two prior mechanism claims are explicitly retracted. The study is notable for its honest reporting of stress-test failures and pre-registered replication.

Evaluation and Benchmarking AI Safety Research Qwen2.5-7B-Instruct-1M ReAct stealth-divergence +5 more