6arXiv cs.CL (Computation and Language)·15d ago

Phantom specialization in circuit discovery: structural differences don't imply distinct mechanisms

A new arXiv preprint challenges a core assumption in mechanistic interpretability: that structurally different circuits discovered for the same task imply distinct computational mechanisms. Using Literal Sequence Copying across token-frequency bands in five Pythia models (70M–1.4B), the authors extract 75 circuits and show that structurally distinct circuits implement the same computation, with band-specific edges transferring broadly and a shared core recovering ≥99% of circuit performance. The paper introduces the term 'phantom specialization' for this pattern and argues that standard source-level evaluation inflates apparent faithfulness, while edge-level evaluation and cross-condition transfer tests are needed to detect the many-to-one mapping from structure to function.

Evaluation and Benchmarking AI Safety Research Pythia Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6Openai Blog·1mo ago·source ↗

Understanding Neural Networks Through Sparse Circuits

OpenAI has published work on mechanistic interpretability using a sparse model approach aimed at understanding how neural networks reason internally. The research seeks to make AI systems more transparent by identifying sparse circuits within neural networks. This work is positioned as supporting safer and more reliable AI behavior through improved interpretability.

Evaluation and Benchmarking AI Safety Research Sparse Circuits mechanistic interpretability OpenAI

5arXiv · cs.CL·4d ago·source ↗

Post-hoc falsification operators for frozen small code models fail to beat Best-of-N in leakage-free evaluation

A measurement study evaluates 26 post-hoc operators (selection, verification, repair, elimination, portfolios) applied to frozen small code models (≤1.5B parameters) against a Best-of-N baseline under a strict leakage-free, matched-compute protocol. None of the semantic operators improves held-out accuracy over BoN, with the failure traced to three structural mechanisms: a coverage wall, a capability scissors, and a near-empty consensus trap. Two non-semantic operators do provide value: an expression-layer recovery method (M1) lifts DeepSeek-Coder-1.3B by +12 tasks on HumanEval+ (p=2.4e-4), and an adaptive consensus early-stop saves ~19% compute with no accuracy harm. The paper's core lesson is that harness quality and coverage measurement should precede investment in semantic post-hoc reasoning.

Evaluation and Benchmarking Inference Economics Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models deepseek-coder Best-of-N +2 more

5arXiv · cs.LG·1mo ago·source ↗

Layer Equivalence Is Not a Property of Layers Alone: How You Test Redundancy Changes What You Find

This paper distinguishes two protocols for measuring transformer layer redundancy—replacement (can one layer substitute for another in place?) and interchange (do two layers approximately commute when swapped?)—and shows they can disagree substantially. Experiments on Pythia (410M, 1.4B) and 8B-scale models (Qwen3-8B, Llama-3.1-8B) reveal that the protocol gap grows during training and can change which layers appear safe to prune by several-fold. Notably, Qwen3-8B shows interchange-guided removal is far safer than replacement-guided at the same layer budgets, while Llama-3.1-8B ties the two protocols despite lower interchange KL. The authors recommend scoring both swap-KL metrics before any layer removal or merging, requiring only unlabeled forward passes.

Evaluation and Benchmarking Inference Economics WikiText-2 layer pruning Pythia +3 more

6arXiv · cs.LG·2d ago·source ↗

Program synthesis used to reverse-engineer transformer attention heads with executable Python surrogates

Researchers propose a pipeline that approximates transformer attention heads with executable Python programs generated by a language model, then re-ranked by held-out predictive accuracy. Applied to GPT-2, TinyLlama-1.1B, and Llama-3B, fewer than 1,000 programs reproduce attention patterns with >75% average IoU similarity on TinyStories. Replacing 25% of attention heads with programmatic surrogates incurs only a 16% average perplexity increase while preserving downstream QA performance, demonstrating a path toward symbolic transparency in neural models.

Evaluation and Benchmarking AI Safety Research Llama 3.2 GPT-2 Explaining Attention with Program Synthesis +2 more

6arXiv · cs.CL·10d ago·source ↗

PhantomBench: Large-scale benchmark reveals staggering hallucination rates on non-existent concepts

PhantomBench is a new benchmark comprising over 60,000 non-existent terms and entities derived from real concepts, designed to test whether language models can recognize the limits of their knowledge. Evaluating 21 models of various types and sizes, the authors find hallucination rates as high as 86.7% on average, with even frontier models failing to abstain when inputs presuppose the existence of fabricated concepts. The benchmark also serves as a proxy for studying model behavior on rare real concepts, and includes a pipeline for scalable generation of custom non-existent concept sets.

Evaluation and Benchmarking AI Safety Research PhantomBench

7Anthropic News·16d ago·source ↗

Anthropic demonstrates feature steering in Claude 3 Sonnet via interpretability research

Anthropic released a 24-hour public demo called 'Golden Gate Claude' to illustrate findings from a major interpretability paper on Claude 3 Sonnet. The research identifies millions of internal 'features' — neuron combinations that activate for specific concepts — and shows these can be surgically amplified or suppressed to alter model behavior without prompting or fine-tuning. The Golden Gate Bridge feature was amplified as a demonstration, causing the model to reference the bridge in nearly all responses. Anthropic argues this mechanistic control over internal activations has direct implications for AI safety, including the ability to modulate safety-relevant features like those tied to deception or dangerous code.

AI Safety Research Alignment and RLHF Golden Gate Claude Claude 3 Sonnet Anthropic

5arXiv · cs.LG·5d ago·source ↗

Paper argues Compressed Computation toy model is not computation in superposition

A new arXiv preprint challenges the Compressed Computation (CC) toy model introduced by Braun et al. (2025), which appeared to compute 100 ReLU functions using only 50 neurons. The authors show that apparent performance gains arise from unintended input mixing via a noisy residual stream rather than genuine superposition, with learned neuron directions concentrating in the subspace of the top 50 eigenvalues of the mixing matrix. A semi-non-negative matrix factorization baseline derived purely from the mixing matrix reproduces the qualitative loss profile, supporting the conclusion that CC is not a valid toy model of computation in superposition.

Evaluation and Benchmarking AI Safety Research superposition Compressed Computation is (probably) not Computation in Superposition Braun et al. 2025 Compressed Computation

5arXiv · cs.AI·4d ago·source ↗

Internal Oppenheim-Lim test reveals phase/sign identity codes shared across image classifier architectures

A new arXiv preprint applies a causal intervention inspired by Oppenheim and Lim (1981) to probe whether trained image classifiers encode identity in Fourier phase rather than magnitude within their hidden layers. By transplanting phase or sign components between images at chosen layers in PRISM2D, GFNet, ViT-B/16, and ResNet-50, the authors find that predictions follow the phase/sign donor across all tested architectures, with image-specific magnitude largely dispensable. ResNet-50 requires a pre-ReLU intervention to reveal a latent sign code, exposing how rectification and readout geometry shape the basis in which the code is expressed. The findings offer a mechanistic account of the texture–shape gap between CNNs and attention-based models.

Evaluation and Benchmarking ViT-B/16 GFNet PRISM2D +2 more