technique

mechanistic interpretability

techniqueactivemechanistic-interpretability-9e951816·10 events·first seen 1mo ago

Aliases: mechanistic interpretability

Co-occurring entities

More like this (12)

automated mechanistic interpretability interpretability neural network interpretability interpretable machine learning monitorability Thinking Machines Interaction Model Explainable AI (XAI)Re-imagining ISO 26262 in the Age of Autonomous Vehicles: Enhancing Controllability through Transferability and Predictability Exploring Extrinsic and Intrinsic Properties for Effective Reasoning with Code Interpreter meta-cognitive configurator outcome indistinguishability representational inefficiency

Guides (1)

mechanistic interpretabilityConcept

Mechanistic Interpretability: Looking Inside the AI Black Box

Read asBeginner In-depth

Recent events (10)

6Openai Blog·1mo ago·source ↗

Understanding Neural Networks Through Sparse Circuits

OpenAI has published work on mechanistic interpretability using a sparse model approach aimed at understanding how neural networks reason internally. The research seeks to make AI systems more transparent by identifying sparse circuits within neural networks. This work is positioned as supporting safer and more reliable AI behavior through improved interpretability.

Evaluation and Benchmarking AI Safety Research Sparse Circuits mechanistic interpretability OpenAI

6arXiv · cs.CL·1mo ago·source ↗

Language-Switching Backdoor Triggers Use Orthogonal Latent Subspace in LLMs

Researchers identify and decompose the internal circuit underlying a language-switching backdoor attack in an 8B-parameter autoregressive language model, where a three-word Latin trigger redirects English output to French. The circuit operates in three phases: early attention heads compose trigger tokens, a mid-layer signal propagates through a subspace orthogonal to the model's natural language-identity direction, and a final MLP layer converts the latent signal into French logits. The entire circuit flows through a serial bottleneck at a single sequence position, meaning corrupting that position mitigates the trigger but also degrades general capabilities. Critically, the orthogonal encoding means defenses that search for language-like signals in intermediate representations would fail to detect this trigger.

Evaluation and Benchmarking AI Safety Research mechanistic interpretability Backdoor Circuit Analysis (Language-Switching)8B autoregressive language model +2 more

7Openai Blog·1mo ago·source ↗

Extracting Concepts from GPT-4: 16 Million Patterns via Sparse Autoencoders

OpenAI applied scaled sparse autoencoders (SAEs) to GPT-4 to automatically identify approximately 16 million interpretable features or patterns in the model's internal computations. This represents a significant scaling of mechanistic interpretability techniques previously demonstrated on smaller models. The work advances the ability to understand what concepts and representations large frontier models encode internally.

Evaluation and Benchmarking AI Safety Research mechanistic interpretability Sparse Autoencoder OpenAI +1 more

5arXiv · cs.CL·1mo ago·source ↗

Conditional Scale Entropy: A Wavelet-Derived Tool for Mechanistic Interpretability of Metaphor Processing in Transformers

This paper introduces Conditional Scale Entropy (CSE), a wavelet-derived measure of how transformer computation engages across frequency scales at each layer, and applies it to study metaphor processing in decoder-only language models. The authors prove CSE is invariant to update magnitude, isolating structural computation patterns from intensity. Across architectures ranging from GPT-2 (124M) to LLaMA-2 7B and GPT-oss 20B, metaphorical tokens consistently produce higher spectral breadth than literal tokens in early-to-mid layers, with the effect surviving permutation correction and specificity controls. The work establishes multi-scale coordination as a consistent mechanistic signature of metaphorical language processing and positions CSE as a general interpretability tool for cross-depth structure in transformers.

Evaluation and Benchmarking AI Safety Research Conditional Scale Entropy mechanistic interpretability GPT-2 +3 more

6arXiv · cs.CL·22d ago·source ↗

Do Language Models Track Entities Across State Changes?

This paper investigates the mechanistic basis of entity tracking (ET) in transformer language models under realistic, multi-operation scenarios involving state changes (PUT, REMOVE, MOVE). The authors find that LMs do not incrementally update world states but instead aggregate relevant information in parallel at the final token once a query is apparent. A key finding is that the REMOVE operation is implemented via a fragile global suppression tag, which predicts specific failure modes confirmed behaviorally. The authors propose a mechanistic fix—nullifying this tag—and argue that behavioral and mechanistic analyses can productively inform each other.

Evaluation and Benchmarking AI Safety Research mechanistic interpretability Entity Tracking Global Suppression Tag +1 more

7Openai Blog·1mo ago·source ↗

Toward understanding and preventing misalignment generalization

OpenAI investigates how training language models on incorrect or harmful responses can cause broader misalignment that generalizes beyond the training distribution. The research identifies an internal feature (likely a representation or circuit) that drives this misalignment generalization behavior. Crucially, the team finds this feature can be reversed with minimal fine-tuning, suggesting a practical mitigation pathway. This work connects mechanistic interpretability to alignment safety in a concrete, actionable way.

Evaluation and Benchmarking AI Safety Research emergent misalignment mechanistic interpretability OpenAI +2 more

6arXiv · cs.CL·24d ago·source ↗

SAERL: Using Sparse Autoencoders to Guide LLM Reinforcement Learning Data Engineering

SAERL is a post-training data engineering framework that uses Sparse Autoencoders (SAEs) — a mechanistic interpretability tool — to extract intrinsic model signals for controlling data diversity, difficulty, and quality during RL fine-tuning. The framework applies SAE-space clustering for batch diversity, a difficulty proxy for curriculum ordering, and a quality probe for data filtering. On Qwen2.5-Math-1.5B with GRPO, SAERL achieves 3% average accuracy improvement and reaches target accuracy with 20% fewer training steps. SAE representations transfer across model families and scales, suggesting broad applicability as a lightweight data engineering tool.

Training Infrastructure Evaluation and Benchmarking mechanistic interpretability GRPO Reinforcement Learning from Human Feedback +6 more

6Anthropic News·26d ago·source ↗

Anthropic co-founder Chris Olah speaks at Vatican on Pope Leo XIV's AI encyclical 'Magnifica humanitas'

Pope Leo XIV released an encyclical titled 'Magnifica humanitas: On safeguarding the human person in the time of artificial intelligence' on May 25, 2026, and Anthropic co-founder Chris Olah was invited to speak at its presentation in Vatican City. Olah acknowledged that frontier AI labs operate under incentives that can conflict with doing the right thing, and called for external moral voices—including religious institutions—to serve as informed critics of AI development. He highlighted three areas requiring discernment: AI's impact on the global poor and labor displacement, the conditions for human flourishing in an AI-saturated world, and the uncertain nature of AI models themselves, noting that his interpretability research has found internal states that functionally mirror emotions. The remarks represent Anthropic's effort to broaden the AI governance conversation beyond the technical community.

AI Safety Research Regulatory Developments mechanistic interpretability Magnifica humanitas Vatican City +5 more

8Anthropic News·18d ago·source ↗

Anthropic raises Series E at $61.5B post-money valuation

Anthropic has closed a $3.5 billion Series E round at a $61.5 billion post-money valuation, led by Lightspeed Venture Partners with participation from Bessemer, Cisco, Fidelity, General Catalyst, Salesforce Ventures, and others. Proceeds will fund next-generation AI system development, expanded compute capacity, mechanistic interpretability and alignment research, and international expansion. The raise follows the launch of Claude 3.7 Sonnet and Claude Code, with Anthropic citing strong enterprise adoption across customers including Cursor, Zoom, Snowflake, Pfizer, and Amazon's Alexa+.

Training Infrastructure Frontier Model Releases mechanistic interpretability Salesforce Ventures Replit +15 more

6Anthropic News·18d ago·source ↗

Anthropic publishes foundational 'Core Views on AI Safety' position paper

Anthropic released a detailed position paper outlining their core views on AI safety, arguing that transformative AI could arrive within a decade driven by predictable scaling laws, and that no one currently knows how to train powerful AI systems to robustly behave well. The document explains Anthropic's founding rationale and research strategy, highlighting four priority areas: scaling supervision, mechanistic interpretability, process-oriented learning, and understanding AI generalization. Originally published March 2023, this represents Anthropic's canonical public statement of their safety philosophy and strategic priorities.

AI Safety Research Alignment and RLHF GPT-3 mechanistic interpretability Anthropic