Entity · technique

Sparse Autoencoders

techniqueactivesparse-autoencoders-c951245e·3 events·first seen Jun 11, 2026

Aliases: Sparse Autoencoders

Co-occurring entities

Cross-sample Consistency Regularization C²R: Cross-sample Consistency Regularization Mitigates Feature Splitting and Absorption in Sparse Autoencoders Forecasting With LLMs: Improved Generalization Through Feature Steering Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

More like this (12)

Sparse Autoencoders (SAEs)Sparse Autoencoder Feature Auto-Encoder Natural Language Autoencoders Sparse Autoencoders Encode Both Concepts and Functions: The Downstream Geometry of Feature Effects Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders masked autoencoding conditional variational autoencoder Cross-seed explainability using Procrustes-conditioned Joint End-to-end Top-K Sparse Autoencoders Multiplayer Interactive World Models with Representation Autoencoders Graph Neural Network Encoder Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders

Recent events (3)

5arXiv · cs.LG·Jun 30, 2026·source ↗

C²R regularization method addresses feature splitting and absorption in Sparse Autoencoders

A new arXiv preprint introduces C²R (Cross-sample Consistency Regularization), a training technique for Sparse Autoencoders (SAEs) that mitigates two known failure modes: feature splitting, where coherent concepts fragment across multiple latents, and feature absorption, where general features develop arbitrary exceptions. C²R penalizes co-activation of directionally similar latents across a batch, encouraging each semantic concept to map consistently to a single latent. The authors report that C²R reduces both pathologies while preserving reconstruction fidelity, with source code released publicly.

Evaluation and Benchmarking AI Safety Research Sparse Autoencoders Cross-sample Consistency Regularization C²R: Cross-sample Consistency Regularization Mitigates Feature Splitting and Absorption in Sparse Autoencoders

5arXiv · cs.CL·Jun 26, 2026·source ↗

Feature steering via sparse autoencoders reduces look-ahead bias in LLM forecasting

Researchers apply sparse autoencoders to inspect LLM internal states during forecasting tasks, identifying features associated with time-aware versus look-ahead-biased reasoning. Amplifying time-awareness features causally reduces look-ahead bias while preserving general reasoning performance, whereas directly steering look-ahead-bias features has no effect. The work demonstrates that interpretable temporal features can shift LLMs toward more historically grounded forecasting. This is a mechanistic interpretability result with practical implications for LLM-based prediction systems.

Evaluation and Benchmarking AI Safety Research Forecasting With LLMs: Improved Generalization Through Feature Steering Sparse Autoencoders

6arXiv · cs.CL·Jun 11, 2026·source ↗

Study finds SAE unstable features reflect reproducible subspaces, not pure noise

A new arXiv paper investigates feature stability in sparse autoencoders (SAEs), measuring the probability that individual learned features reappear across independent training runs. The authors find a functional asymmetry: stable features carry most reconstruction-relevant signal, while unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting seed dependence reflects basis ambiguity rather than noise. A synthetic model confirms that low-rank ground-truth features can be recovered at the subspace level even when individual SAE latents are non-identifiable across seeds. The work has direct implications for interpretability research that relies on SAE features as meaningful, stable units of analysis.

Evaluation and Benchmarking AI Safety Research Sparse Autoencoders Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders