Functional Welfare Axis
functional-welfare-axis-b27c027e·1 events·first seen 18d agoAliases: Functional Welfare Axis
Co-occurring entities
More like this (12)
Recent events (1)
Reinforcement Learning Recruits a Pre-Existing 'Functional Welfare' Axis in Language Models
Researchers trained language models in a semantically neutral maze environment and extracted concept vectors for rewarded and punished trajectories, finding that RL recruits a pre-existing representational axis encoding functional welfare—how well or badly the system is doing relative to its goals. The punishment vector promotes failure tokens, aligns with negative emotion concepts, and induces refusal and uncertainty when used for steering; the reward vector is its near-antiparallel mirror. Critically, these vectors are effective in models before maze training and appear in pretrain-only models, suggesting the welfare axis pre-exists post-training rather than being created by it. The findings have implications for interpretability, alignment, and understanding how minimal reward signals can broadly reshape model behavior.