Entity · technique

Sparse Autoencoders (SAEs)

techniqueactivesparse-autoencoders-saes--2ab7a5b8·1 events·first seen May 28, 2026

Aliases: Sparse Autoencoders (SAEs)

Co-occurring entities

Subspace Projection Gemma-3-4B-IT Minerva Math Task Vectors SAE Specificity Score

More like this (12)

Sparse Autoencoders Sparse Autoencoder Feature Auto-Encoder Natural Language Autoencoders Variational Autoencoder (VAE)VAE Encoder Sparse Autoencoders Encode Both Concepts and Functions: The Downstream Geometry of Feature Effects Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders conditional variational autoencoder Cross-seed explainability using Procrustes-conditioned Joint End-to-end Top-K Sparse Autoencoders masked autoencoding Sparse Embedding Models

Recent events (1)

6arXiv · cs.CL·May 28, 2026·source ↗

SAEs as Stethoscopes: Interpretability-Guided Layer Selection for Task Vector Model Editing

This paper evaluates a Sparse Autoencoder (SAE)-guided model editing pipeline for mathematical reasoning on Gemma-3-4B-IT, finding that projecting task vectors onto SAE feature subspaces discards ~97% of modification energy due to geometric misalignment between activation-space SAE directions and weight-space task vectors. The authors reframe SAEs as diagnostic tools ('stethoscopes') rather than intervention filters ('scalpels'), using SAE-derived specificity scores to identify which layers to inject unfiltered task vectors into. This approach improves Number Theory accuracy from 29.6% to 39.4% on Minerva Math (p=0.0007), with 5 of 7 math subjects significantly improved and none degraded. The method is fully deterministic and adds no inference cost.

Evaluation and Benchmarking AI Safety Research Subspace Projection Gemma-3-4B-IT Sparse Autoencoders (SAEs)+4 more