Entity · technique

Functional Welfare Axis

techniqueactivefunctional-welfare-axis-b27c027e·1 events·first seen May 29, 2026

Aliases: Functional Welfare Axis

Co-occurring entities

Reinforcement Learning from Human Feedback Concept Vector Extraction LoRA supervised fine-tuning

More like this (12)

Social Value Orientation model welfare Functional Attention Anthropic Economic Policy Framework assistant axis Anthropic Economic Index Human Security WISE Attainable Utility Preservation Anthropic Beneficial Deployments Social Finance Correctness-Efficiency Frontier

Recent events (1)

7arXiv · cs.CL·May 29, 2026·source ↗

Reinforcement Learning Recruits a Pre-Existing 'Functional Welfare' Axis in Language Models

Researchers trained language models in a semantically neutral maze environment and extracted concept vectors for rewarded and punished trajectories, finding that RL recruits a pre-existing representational axis encoding functional welfare—how well or badly the system is doing relative to its goals. The punishment vector promotes failure tokens, aligns with negative emotion concepts, and induces refusal and uncertainty when used for steering; the reward vector is its near-antiparallel mirror. Critically, these vectors are effective in models before maze training and appear in pretrain-only models, suggesting the welfare axis pre-exists post-training rather than being created by it. The findings have implications for interpretability, alignment, and understanding how minimal reward signals can broadly reshape model behavior.

Evaluation and Benchmarking AI Safety Research Reinforcement Learning from Human Feedback Concept Vector Extraction LoRA +3 more