paper

Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders

paperactiveprovisionalbeyond-the-hard-budget-sparsity-regularizers-for-more-interpretable-top-k-sparse-autoencoders-427ead72·1 events·first seen 3d ago

Aliases: Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders

Co-occurring entities

Sparse Autoencoder

More like this (12)

Sparse Autoencoders Sparse Autoencoder Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs Entropy Regularization Entropy-Regularized Reinforcement Learning Feature Auto-Encoder Sparse Autoencoders (SAEs)intra-frame entropy-guided sparsification Sparse Embedding Models conditional variational autoencoder KL-Cov regularization

Recent events (1)

4arXiv · cs.AI·3d ago·source ↗

Sparsity regularizers improve interpretability of Top-k sparse autoencoders for vision models

A new arXiv preprint proposes two sparsity regularizers compatible with Top-k sparse autoencoders (SAEs), a standard tool for mechanistic interpretability of vision foundation models. The regularizers — an ℓ1 penalty on off-support units and a scale-invariant ℓ1/ℓ2-ratio penalty — are applied before Top-k selection and consistently improve monosemanticity without degrading reconstruction quality across two datasets and three vision models. The central finding is that hard architectural sparsity and soft regularization are complementary, addressing known limitations of fixed-budget Top-k SAEs such as overfitting to training k values.

Evaluation and Benchmarking AI Safety Research Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders Sparse Autoencoder