4arXiv cs.AI (Artificial Intelligence)·3d ago

Sparsity regularizers improve interpretability of Top-k sparse autoencoders for vision models

A new arXiv preprint proposes two sparsity regularizers compatible with Top-k sparse autoencoders (SAEs), a standard tool for mechanistic interpretability of vision foundation models. The regularizers — an ℓ1 penalty on off-support units and a scale-invariant ℓ1/ℓ2-ratio penalty — are applied before Top-k selection and consistently improve monosemanticity without degrading reconstruction quality across two datasets and three vision models. The central finding is that hard architectural sparsity and soft regularization are complementary, addressing known limitations of fixed-budget Top-k SAEs such as overfitting to training k values.

Evaluation and Benchmarking AI Safety Research Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders Sparse Autoencoder

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·18d ago·source ↗

Study finds SAE unstable features reflect reproducible subspaces, not pure noise

A new arXiv paper investigates feature stability in sparse autoencoders (SAEs), measuring the probability that individual learned features reappear across independent training runs. The authors find a functional asymmetry: stable features carry most reconstruction-relevant signal, while unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting seed dependence reflects basis ambiguity rather than noise. A synthetic model confirms that low-rank ground-truth features can be recovered at the subspace level even when individual SAE latents are non-identifiable across seeds. The work has direct implications for interpretability research that relies on SAE features as meaningful, stable units of analysis.

Evaluation and Benchmarking AI Safety Research Sparse Autoencoders Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

6arXiv · cs.CL·1mo ago·source ↗

SAEs as Stethoscopes: Interpretability-Guided Layer Selection for Task Vector Model Editing

This paper evaluates a Sparse Autoencoder (SAE)-guided model editing pipeline for mathematical reasoning on Gemma-3-4B-IT, finding that projecting task vectors onto SAE feature subspaces discards ~97% of modification energy due to geometric misalignment between activation-space SAE directions and weight-space task vectors. The authors reframe SAEs as diagnostic tools ('stethoscopes') rather than intervention filters ('scalpels'), using SAE-derived specificity scores to identify which layers to inject unfiltered task vectors into. This approach improves Number Theory accuracy from 29.6% to 39.4% on Minerva Math (p=0.0007), with 5 of 7 math subjects significantly improved and none degraded. The method is fully deterministic and adds no inference cost.

Evaluation and Benchmarking AI Safety Research Subspace Projection Gemma-3-4B-IT Sparse Autoencoders (SAEs)+4 more

6arXiv · cs.CL·1mo ago·source ↗

SAERL: Using Sparse Autoencoders to Guide LLM Reinforcement Learning Data Engineering

SAERL is a post-training data engineering framework that uses Sparse Autoencoders (SAEs) — a mechanistic interpretability tool — to extract intrinsic model signals for controlling data diversity, difficulty, and quality during RL fine-tuning. The framework applies SAE-space clustering for batch diversity, a difficulty proxy for curriculum ordering, and a quality probe for data filtering. On Qwen2.5-Math-1.5B with GRPO, SAERL achieves 3% average accuracy improvement and reaches target accuracy with 20% fewer training steps. SAE representations transfer across model families and scales, suggesting broad applicability as a lightweight data engineering tool.

Training Infrastructure Evaluation and Benchmarking mechanistic interpretability GRPO Reinforcement Learning from Human Feedback +6 more

5arXiv · cs.CL·13h ago·source ↗

VASAE: Vocabulary-Aligned Sparse Autoencoder assigns intrinsic token names to SAE features during training

Researchers introduce VASAE (Vocabulary-Aligned Sparse Autoencoder), a method that trains SAE features with vocabulary-aligned anchoring so each feature is intrinsically named by the nearest token in the model's embedding space. Applied to GPT-2-small and Llama-3.1-8B, VASAE achieves ~90% feature alignment in shallow-to-middle layers without degrading reconstruction quality, though final-layer alignment is limited. The work addresses a longstanding interpretability bottleneck where SAE dictionary features require expensive post-hoc labeling, potentially enabling more scalable mechanistic analysis.

Evaluation and Benchmarking AI Safety Research GPT-2-small VASAE Llama-3.1-8B

7Openai Blog·1mo ago·source ↗

Extracting Concepts from GPT-4: 16 Million Patterns via Sparse Autoencoders

OpenAI applied scaled sparse autoencoders (SAEs) to GPT-4 to automatically identify approximately 16 million interpretable features or patterns in the model's internal computations. This represents a significant scaling of mechanistic interpretability techniques previously demonstrated on smaller models. The work advances the ability to understand what concepts and representations large frontier models encode internally.

Evaluation and Benchmarking AI Safety Research mechanistic interpretability Sparse Autoencoder OpenAI +1 more

5arXiv · cs.CL·19d ago·source ↗

Predictor-gated bank-wise sparsity recipe for dense-to-sparse LLM upcycling from Qwen2.5-8B

A new arXiv preprint introduces a continual training recipe to convert dense LLMs into channel-sparse models without post-hoc pruning. Starting from a Qwen2.5-8B checkpoint, the method uses a low-rank predictor to gate FFN channel routing, achieving 4x sparsity in FFN intermediate activations via a bank-wise top-k rule at 32K context. The routing module is trained on the main language modeling path, making the resulting sparsity hardware-oriented rather than approximate. The authors also identify and patch a layer-local long-context failure mode on the RULER-CWE benchmark.

Training Infrastructure Inference Economics Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs SwiGLU RULER-CWE +1 more

5arXiv · cs.CL·3d ago·source ↗

Feature steering via sparse autoencoders reduces look-ahead bias in LLM forecasting

Researchers apply sparse autoencoders to inspect LLM internal states during forecasting tasks, identifying features associated with time-aware versus look-ahead-biased reasoning. Amplifying time-awareness features causally reduces look-ahead bias while preserving general reasoning performance, whereas directly steering look-ahead-bias features has no effect. The work demonstrates that interpretable temporal features can shift LLMs toward more historically grounded forecasting. This is a mechanistic interpretability result with practical implications for LLM-based prediction systems.

Evaluation and Benchmarking AI Safety Research Forecasting With LLMs: Improved Generalization Through Feature Steering Sparse Autoencoders

6arXiv · cs.AI·21d ago·source ↗

Sparse AutoEncoder steering reduces Whisper hallucination rate by ~5x without fine-tuning

Researchers investigate hallucination detection and mitigation in OpenAI's Whisper ASR model by probing internal encoder representations. They find that both raw activations and Sparse AutoEncoder (SAE) latents encode linearly separable hallucination signals concentrated in deeper layers. SAE-based activation steering reduces hallucination rates from 72.6% to 14.1% (Whisper small) and 86.9% to 27.3% (Whisper large-v3) on non-speech audio, with minimal WER degradation, approaching fine-tuning-level performance without weight updates.

Evaluation and Benchmarking AI Safety Research Sparse Autoencoder OpenAI Whisper