Almanac
technique

Sparse Autoencoders

techniqueactiveprovisionalsparse-autoencoders-c951245e·1 events·first seen 6d ago

Aliases: Sparse Autoencoders

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·6d ago·source ↗

Study finds SAE unstable features reflect reproducible subspaces, not pure noise

A new arXiv paper investigates feature stability in sparse autoencoders (SAEs), measuring the probability that individual learned features reappear across independent training runs. The authors find a functional asymmetry: stable features carry most reconstruction-relevant signal, while unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting seed dependence reflects basis ambiguity rather than noise. A synthetic model confirms that low-rank ground-truth features can be recovered at the subspace level even when individual SAE latents are non-identifiable across seeds. The work has direct implications for interpretability research that relies on SAE features as meaningful, stable units of analysis.