The Value Axis: Language Models Encode Whether They're on the Right Track
the-value-axis-language-models-encode-whether-they-re-on-the-right-track-61633f04·1 events·first seen 28h agoAliases: The Value Axis: Language Models Encode Whether They're on the Right Track
Co-occurring entities
More like this (12)
Recent events (1)
Language models linearly encode a 'value axis' tracking expected goal success, study finds
Researchers construct a 'value axis' in Qwen3-8B's activation space using synthetic in-context RL data, finding that this axis distinguishes high vs. low confidence, backtracking vs. non-backtracking rollouts, and correct vs. corrupted code. Steering along this axis causally modulates self-correction behavior and verbosity, while DPO training shifts the internal value of rewarded behaviors. Applied to real-world settings, the axis reveals that Qwen assigns low internal value to politically sensitive queries post-training and that SFT increases domain-specific confidence. The findings suggest LLMs linearly encode an estimate of expected goal success that shapes their generative behavior.