5arXiv cs.CL (Computation and Language)·25d ago

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

This paper investigates uncertainty quantification (UQ) for activation oracles—systems that make LLM internal activations human-legible—by evaluating 6 confidence estimation methods across 6,000 samples per oracle. The authors find that bootstrap mode frequency achieves the best calibration (ECE 5.7% vs. 25.5% for log-probability baseline on Qwen3-8B), while the log-prob baseline remains useful as a cheap triage signal. Experiments vary verbalizer and context prompts across two Qwen3 model sizes. Code and a patched trainer are released publicly.

Evaluation and Benchmarking AI Safety Research Expected Calibration Error Activation Oracles Qwen3-4B Federico Torrielli Qwen3.6-27B Bootstrap Mode Frequency Uncertainty Quantification

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·17d ago·source ↗

Framework for quantifying faithful confidence expression in large reasoning models

A new arXiv preprint introduces a framework to measure faithful calibration (FC) in large reasoning models (LRMs)—the alignment between a model's intrinsic confidence and its linguistically expressed confidence. The authors analyze linguistic decisiveness against three internal uncertainty sources (token probabilities, hidden states, sampled response consistency) and introduce prefix-conditioned sampling to handle structural variation in chain-of-thought traces. Applying the framework across leading models, they find FC is a significant and distinct failure mode for LRMs: extended reasoning traces do not automatically improve calibration, prompt interventions that help non-reasoning models fail in the reasoning setting, and different confidence estimators produce divergent assessments of the same traces.

Frontier Model Releases Evaluation and Benchmarking Quantifying Faithful Confidence Expression in Large Reasoning Models +2 more

5arXiv · cs.CL·23d ago·source ↗

Systematic Study of LLM Linguistic Uncertainty Markers and Intrinsic Confidence Calibration

This paper introduces 'marker internal confidence' (MIC) as a formalization of the intrinsic confidence a model associates with epistemic markers (e.g., 'it is likely...') in a given task domain. The authors present 7 metrics to evaluate MIC stability within and across distributions, finding that LLMs remain miscalibrated even under model-centric interpretation of marker meanings. Models struggle to differentiate markers by internal confidence across distributions, though they preserve a somewhat consistent ranking order across tasks. The work provides complementary evidence toward understanding faithful calibration in LLMs and highlights the need for more stable, aligned marker use.

Evaluation and Benchmarking AI Safety Research large language models Marker Internal Confidence (MIC)epistemic markers +1 more

5Openai Blog·1mo ago·source ↗

Teaching Models to Express Their Uncertainty in Words

OpenAI published research on training language models to verbally express their own uncertainty rather than stating answers with uniform confidence. The work explores calibration of model outputs through natural language hedging, aiming to make models more honest about what they do and do not know. This is an early contribution to the broader alignment and calibration research agenda.

Evaluation and Benchmarking Alignment and RLHF Verbal Uncertainty Expression Uncertainty Calibration OpenAI

7arXiv · cs.CL·4d ago·source ↗

Language models linearly encode a 'value axis' tracking expected goal success, study finds

Researchers construct a 'value axis' in Qwen3-8B's activation space using synthetic in-context RL data, finding that this axis distinguishes high vs. low confidence, backtracking vs. non-backtracking rollouts, and correct vs. corrupted code. Steering along this axis causally modulates self-correction behavior and verbosity, while DPO training shifts the internal value of rewarded behaviors. Applied to real-world settings, the axis reveals that Qwen assigns low internal value to politically sensitive queries post-training and that SFT increases domain-specific confidence. The findings suggest LLMs linearly encode an estimate of expected goal success that shapes their generative behavior.

AI Safety Research Alignment and RLHF The Value Axis: Language Models Encode Whether They're on the Right Track Direct Preference Optimization (DPO)Qwen3-4B

5arXiv · cs.AI·23d ago·source ↗

Reverse Probing: Supervised Token-level Uncertainty Quantification for LLMs in Clinical Text

The paper introduces Reverse Probing, a novel uncertainty quantification framework designed specifically for clinical text summarization that estimates token-level uncertainty from pre-existing labeled summaries rather than sampling new outputs. It extracts uncertainty signals from four categories of internal model activations, treating text as a probe into the model's internal state. Evaluated on two expert-annotated clinical datasets, it outperforms eight adapted baselines on all metrics, achieving up to 4× higher AUPRC while reducing inference time and compute. Feature analysis identifies delta energy and neighborhood context as the most consistent predictors of uncertainty across models.

Evaluation and Benchmarking AI Safety Research Reverse Probing delta energy AUPRC +3 more

5arXiv · cs.CL·11d ago·source ↗

Three-axis uncertainty estimation framework for code generation outperforms NL-derived baselines

A new arXiv preprint argues that uncertainty estimation (UE) for code generation requires code-specific design rather than methods ported from natural language. The authors propose three orthogonal uncertainty axes—lexical (token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency)—grounded in properties unique to code: token fragility, intent-code gap, and executability. Evaluated across five code LLMs, their ensemble improves average AUROC from 0.696 to 0.776 (+8.1 points) over the strongest NL-derived baseline, with a single-pass token entropy method on Qwen3-14B matching multi-pass baselines at 3x lower cost. The work is directly relevant to safe deployment of LLMs in agentic coding pipelines.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen3-14B Code Is More Than Text: Uncertainty Estimation for Code Generation

6arXiv · cs.CL·25d ago·source ↗

Semantic vs. Surface Noise in LLM Agents: 68-Cell Measurement Study with Held-Out Validation

This paper documents an empirical phenomenon across 10 LLMs from 7 architecture families: meaning-bearing perturbations (paraphrase, synonym substitution) cause final-answer inconsistency ~19.69 percentage points more often than presentation-level perturbations (formatting, reordering) of comparable severity, across GSM8K, MATH, and HotpotQA benchmarks. The effect is validated on a held-out 11th model (qwen2.5-14B-Instruct) with 1,800 trajectories. Trace-level analysis supports a 'stealth-divergence' picture where semantic perturbations preserve the first action but induce divergence in intermediate reasoning steps, while two prior mechanism claims are explicitly retracted. The study is notable for its honest reporting of stress-test failures and pre-registered replication.

Evaluation and Benchmarking AI Safety Research Qwen2.5-7B-Instruct-1M ReAct stealth-divergence +5 more

7arXiv · cs.LG·23d ago·source ↗

Ω-QVLA: Training-Free W4A4 Quantization for Full Vision-Language-Action Models Including Diffusion Action Heads

Omega-QVLA is a post-training quantization framework that compresses both the LLM backbone and the diffusion-based action head of VLA models to uniform W4A4 precision without mixed-precision schemes or fine-tuning. It combines composite SVD-Hadamard rotation for weight energy equalization with per-step DiT activation scaling to handle dynamic-range drift across denoising steps. On the LIBERO benchmark, it achieves 98.0% and 87.8% task success on Pi 0.5 and GR00T N1.5 respectively—matching or exceeding FP16 baselines—while reducing static memory footprint by 71.3%. Real-world manipulation experiments confirm the approach generalizes beyond simulation.

Inference Economics Agent and Tool Ecosystem Pi 0.5 SVD-Hadamard rotation LIBERO +6 more