Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders
unstable-features-reproducible-subspaces-understanding-seed-dependence-in-sparse-autoencoders-4780e86d·1 events·first seen 6d agoAliases: Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders
Co-occurring entities
More like this (12)
Recent events (1)
Study finds SAE unstable features reflect reproducible subspaces, not pure noise
A new arXiv paper investigates feature stability in sparse autoencoders (SAEs), measuring the probability that individual learned features reappear across independent training runs. The authors find a functional asymmetry: stable features carry most reconstruction-relevant signal, while unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting seed dependence reflects basis ambiguity rather than noise. A synthetic model confirms that low-rank ground-truth features can be recovered at the subspace level even when individual SAE latents are non-identifiable across seeds. The work has direct implications for interpretability research that relies on SAE features as meaningful, stable units of analysis.