Golden Gate Claude
golden-gate-claude-3cbc41dc·1 events·first seen 13d agoAliases: Golden Gate Claude
Co-occurring entities
More like this (12)
Recent events (1)
Anthropic demonstrates feature steering in Claude 3 Sonnet via interpretability research
Anthropic released a 24-hour public demo called 'Golden Gate Claude' to illustrate findings from a major interpretability paper on Claude 3 Sonnet. The research identifies millions of internal 'features' — neuron combinations that activate for specific concepts — and shows these can be surgically amplified or suppressed to alter model behavior without prompting or fine-tuning. The Golden Gate Bridge feature was amplified as a demonstration, causing the model to reference the bridge in nearly all responses. Anthropic argues this mechanistic control over internal activations has direct implications for AI safety, including the ability to modulate safety-relevant features like those tied to deception or dangerous code.