Entity · paper

The Value Axis: Language Models Encode Whether They're on the Right Track

paperactivethe-value-axis-language-models-encode-whether-they-re-on-the-right-track-61633f04·1 events·first seen Jun 16, 2026

Aliases: The Value Axis: Language Models Encode Whether They're on the Right Track

Co-occurring entities

Direct Preference Optimization (DPO)Qwen3-4B

More like this (12)

Reasoning Language Models Small Vision-Language Models Know When They Are Wrong But Cannot Say So What, Where, and How: Disentangling the Roles of Task, Language, and Model in Code Model Representations Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families Language Models as Measurement Apparatus for Culture OptimismBench: Forecasting Bias and the Alignment Effect in Language Model Judgment From Found to Designed: Concepts as a Design Axis for Large Language Models Arithmetic Pedagogy for Language Models Understanding the Impact of Linguistic Realization Choices on LLM Stance with Causal Tracing Measuring Human Value Expression in Social Media Texts: Calibrated LLM Annotation and Encoder Transfer Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact Transformer Language Models

Recent events (1)

7arXiv · cs.CL·Jun 16, 2026·source ↗

Language models linearly encode a 'value axis' tracking expected goal success, study finds

Researchers construct a 'value axis' in Qwen3-8B's activation space using synthetic in-context RL data, finding that this axis distinguishes high vs. low confidence, backtracking vs. non-backtracking rollouts, and correct vs. corrupted code. Steering along this axis causally modulates self-correction behavior and verbosity, while DPO training shifts the internal value of rewarded behaviors. Applied to real-world settings, the axis reveals that Qwen assigns low internal value to politically sensitive queries post-training and that SFT increases domain-specific confidence. The findings suggest LLMs linearly encode an estimate of expected goal success that shapes their generative behavior.

AI Safety Research Alignment and RLHF The Value Axis: Language Models Encode Whether They're on the Right Track Direct Preference Optimization (DPO)Qwen3-4B