Entity · technique

Sparse Autoencoder

techniqueactivesparse-autoencoder-52aa1950·8 events·first seen May 20, 2026

Aliases: Sparse Autoencoder, Top-k Sparse Autoencoder

Co-occurring entities

More like this (12)

Sparse Autoencoders Sparse Autoencoders (SAEs)Sparse Embedding Models Sparse Transformer Feature Auto-Encoder Natural Language Autoencoders Variational Autoencoder (VAE)Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders Block Sparse Attention sparse attention Sparse Autoencoders Encode Both Concepts and Functions: The Downstream Geometry of Feature Effects Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders

Recent events (8)

6arXiv · cs.CL·2d ago·source ↗

Framework for discovering and controlling personality-like internal representations in LLMs via SAE decomposition

Researchers introduce a framework grounded in Funder's person-situation-behavior triad to identify, intervene on, and validate trait-like internal representations in LLMs. Using sparse autoencoder (SAE) decomposition on contrastive behavior pairs, they locate sparse features corresponding to opposing poles of personality traits, then demonstrate that feature-level interventions produce consistent cross-situational behavioral shifts. Behavioral outcomes on social intelligence tasks show benefit-tradeoff patterns matching human personality research, providing mechanistic evidence that LLMs contain controllable internal structures linking representations to behaviors. The work advances interpretability and controllability of model personality beyond surface-level prompt conditioning.

AI Safety Research Alignment and RLHF Sparse Autoencoder Funder's Personality Triad From Representations to Behaviors: Exploring the Person-Situation-Behavior Triad in LLMs

4arXiv · cs.CL·Jul 10, 2026·source ↗

Procrustes-conditioned Joint SAE extracts cross-seed universal features from BERT models

Researchers introduce a Procrustes-conditioned Joint End-to-end Top-K Sparse Autoencoder (SAE) to address cross-seed feature universality in mechanistic interpretability of BERT models. By applying an orthogonal Procrustes rotation between independently trained models' activation spaces before joint SAE training, the method produces more consistent features (Pearson r ≥ 0.70) than post-hoc alignment baselines across three NLP benchmarks. The work targets a fundamental challenge in dictionary learning: non-convex optimization causes independently trained networks to learn misaligned feature spaces, making it difficult to identify truly universal features. High-universality features are shown to encode interpretable sociolinguistic patterns.

Evaluation and Benchmarking AI Safety Research SST-2 Sparse Autoencoder Cross-seed explainability using Procrustes-conditioned Joint End-to-end Top-K Sparse Autoencoders +2 more

4arXiv · cs.AI·Jun 26, 2026·source ↗

Sparsity regularizers improve interpretability of Top-k sparse autoencoders for vision models

A new arXiv preprint proposes two sparsity regularizers compatible with Top-k sparse autoencoders (SAEs), a standard tool for mechanistic interpretability of vision foundation models. The regularizers — an ℓ1 penalty on off-support units and a scale-invariant ℓ1/ℓ2-ratio penalty — are applied before Top-k selection and consistently improve monosemanticity without degrading reconstruction quality across two datasets and three vision models. The central finding is that hard architectural sparsity and soft regularization are complementary, addressing known limitations of fixed-budget Top-k SAEs such as overfitting to training k values.

Evaluation and Benchmarking AI Safety Research Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders Sparse Autoencoder

7arXiv · cs.CL·Jun 9, 2026·source ↗

RLHF produces shallow political neutrality by severing causal pathways, not erasing partisan structure

Researchers compare internal representations of Llama 3.1 8B before and after RLHF, finding that alignment training does not remove partisan political geometry from the model but instead compresses output variance to produce balanced responses. Sparse autoencoder decomposition shows that policy-encoding features active in the base model become completely inactive in the instruction-tuned version, while feature-level steering experiments confirm the causal disconnect is real. The underlying partisan structure remains intact and can be reactivated by inferring and amplifying a user's partisan identity, suggesting RLHF alignment is functionally fragile. The authors argue this 'disconnection rather than removal' pattern may generalize to other value domains beyond political orientation.

AI Safety Research Alignment and RLHF Reinforcement Learning from Human Feedback The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model Sparse Autoencoder +2 more

6arXiv · cs.AI·Jun 8, 2026·source ↗

Sparse AutoEncoder steering reduces Whisper hallucination rate by ~5x without fine-tuning

Researchers investigate hallucination detection and mitigation in OpenAI's Whisper ASR model by probing internal encoder representations. They find that both raw activations and Sparse AutoEncoder (SAE) latents encode linearly separable hallucination signals concentrated in deeper layers. SAE-based activation steering reduces hallucination rates from 72.6% to 14.1% (Whisper small) and 86.9% to 27.3% (Whisper large-v3) on non-speech audio, with minimal WER degradation, approaching fine-tuning-level performance without weight updates.

Evaluation and Benchmarking AI Safety Research Sparse Autoencoder OpenAI Whisper

7arXiv · cs.LG·Jun 2, 2026·source ↗

Auditing Asset-Specific Preferences in Financial LLMs: Bitcoin Representations and Portfolio Allocation

Researchers develop a three-level audit protocol to test whether LLMs carry built-in biases toward specific financial assets, applying it to Bitcoin across eight frontier models. Using sparse autoencoder features in Gemma 3, they identify a dominant Bitcoin-selective internal feature whose amplification raises Bitcoin's portfolio share by 5.2 percentage points and suppression lowers it by 4.6 pp, even when 'Bitcoin' never appears in the prompt. The work introduces the concept of 'bounded behavioral leverage'—causal influence over outputs via identifiable internal representations—and frames the framework as a foundation for 'know-your-agent' (KYA) standards for autonomous financial agents.

Evaluation and Benchmarking AI Safety Research Gemma 3 know-your-agent (KYA)Sparse Autoencoder +6 more

6arXiv · cs.CL·May 27, 2026·source ↗

SAERL: Using Sparse Autoencoders to Guide LLM Reinforcement Learning Data Engineering

SAERL is a post-training data engineering framework that uses Sparse Autoencoders (SAEs) — a mechanistic interpretability tool — to extract intrinsic model signals for controlling data diversity, difficulty, and quality during RL fine-tuning. The framework applies SAE-space clustering for batch diversity, a difficulty proxy for curriculum ordering, and a quality probe for data filtering. On Qwen2.5-Math-1.5B with GRPO, SAERL achieves 3% average accuracy improvement and reaches target accuracy with 20% fewer training steps. SAE representations transfer across model families and scales, suggesting broad applicability as a lightweight data engineering tool.

Training Infrastructure Evaluation and Benchmarking mechanistic interpretability GRPO Reinforcement Learning from Human Feedback +6 more

7Openai Blog·May 20, 2026·source ↗

Extracting Concepts from GPT-4: 16 Million Patterns via Sparse Autoencoders

OpenAI applied scaled sparse autoencoders (SAEs) to GPT-4 to automatically identify approximately 16 million interpretable features or patterns in the model's internal computations. This represents a significant scaling of mechanistic interpretability techniques previously demonstrated on smaller models. The work advances the ability to understand what concepts and representations large frontier models encode internally.

Evaluation and Benchmarking AI Safety Research mechanistic interpretability Sparse Autoencoder OpenAI +1 more