VASAE: Vocabulary-Aligned Sparse Autoencoder assigns intrinsic token names to SAE features during training
Researchers introduce VASAE (Vocabulary-Aligned Sparse Autoencoder), a method that trains SAE features with vocabulary-aligned anchoring so each feature is intrinsically named by the nearest token in the model's embedding space. Applied to GPT-2-small and Llama-3.1-8B, VASAE achieves ~90% feature alignment in shallow-to-middle layers without degrading reconstruction quality, though final-layer alignment is limited. The work addresses a longstanding interpretability bottleneck where SAE dictionary features require expensive post-hoc labeling, potentially enabling more scalable mechanistic analysis.
Related guides (2)
Related events (8)
SAEs as Stethoscopes: Interpretability-Guided Layer Selection for Task Vector Model Editing
This paper evaluates a Sparse Autoencoder (SAE)-guided model editing pipeline for mathematical reasoning on Gemma-3-4B-IT, finding that projecting task vectors onto SAE feature subspaces discards ~97% of modification energy due to geometric misalignment between activation-space SAE directions and weight-space task vectors. The authors reframe SAEs as diagnostic tools ('stethoscopes') rather than intervention filters ('scalpels'), using SAE-derived specificity scores to identify which layers to inject unfiltered task vectors into. This approach improves Number Theory accuracy from 29.6% to 39.4% on Minerva Math (p=0.0007), with 5 of 7 math subjects significantly improved and none degraded. The method is fully deterministic and adds no inference cost.
SAERL: Using Sparse Autoencoders to Guide LLM Reinforcement Learning Data Engineering
SAERL is a post-training data engineering framework that uses Sparse Autoencoders (SAEs) — a mechanistic interpretability tool — to extract intrinsic model signals for controlling data diversity, difficulty, and quality during RL fine-tuning. The framework applies SAE-space clustering for batch diversity, a difficulty proxy for curriculum ordering, and a quality probe for data filtering. On Qwen2.5-Math-1.5B with GRPO, SAERL achieves 3% average accuracy improvement and reaches target accuracy with 20% fewer training steps. SAE representations transfer across model families and scales, suggesting broad applicability as a lightweight data engineering tool.
Study finds SAE unstable features reflect reproducible subspaces, not pure noise
A new arXiv paper investigates feature stability in sparse autoencoders (SAEs), measuring the probability that individual learned features reappear across independent training runs. The authors find a functional asymmetry: stable features carry most reconstruction-relevant signal, while unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting seed dependence reflects basis ambiguity rather than noise. A synthetic model confirms that low-rank ground-truth features can be recovered at the subspace level even when individual SAE latents are non-identifiable across seeds. The work has direct implications for interpretability research that relies on SAE features as meaningful, stable units of analysis.
Extracting Concepts from GPT-4: 16 Million Patterns via Sparse Autoencoders
OpenAI applied scaled sparse autoencoders (SAEs) to GPT-4 to automatically identify approximately 16 million interpretable features or patterns in the model's internal computations. This represents a significant scaling of mechanistic interpretability techniques previously demonstrated on smaller models. The work advances the ability to understand what concepts and representations large frontier models encode internally.
Sparse AutoEncoder steering reduces Whisper hallucination rate by ~5x without fine-tuning
Researchers investigate hallucination detection and mitigation in OpenAI's Whisper ASR model by probing internal encoder representations. They find that both raw activations and Sparse AutoEncoder (SAE) latents encode linearly separable hallucination signals concentrated in deeper layers. SAE-based activation steering reduces hallucination rates from 72.6% to 14.1% (Whisper small) and 86.9% to 27.3% (Whisper large-v3) on non-speech audio, with minimal WER degradation, approaching fine-tuning-level performance without weight updates.
VEPO: Vision-anchored token selection improves RL for visual reasoning
A new arXiv paper identifies a failure mode of entropy-based credit assignment in multimodal reinforcement learning: vision-sensitive tokens with naturally low entropy are systematically ignored, causing the mechanism to collapse in visual reasoning tasks. The authors propose VEPO (Vision-Entropy token-selection for Policy Optimization), which couples visual sensitivity with token entropy via a multiplicative scheme to redirect gradient credit toward tokens that are both visually grounded and semantically informative. VEPO outperforms entropy-only baselines by 2.28 points at 7B scale and 3.15 points at 3B scale on visual reasoning benchmarks.
Sparsity regularizers improve interpretability of Top-k sparse autoencoders for vision models
A new arXiv preprint proposes two sparsity regularizers compatible with Top-k sparse autoencoders (SAEs), a standard tool for mechanistic interpretability of vision foundation models. The regularizers — an ℓ1 penalty on off-support units and a scale-invariant ℓ1/ℓ2-ratio penalty — are applied before Top-k selection and consistently improve monosemanticity without degrading reconstruction quality across two datasets and three vision models. The central finding is that hard architectural sparsity and soft regularization are complementary, addressing known limitations of fixed-budget Top-k SAEs such as overfitting to training k values.
CADE framework proposes direct timestep embedding and contrastive alignment for time-series question answering
A new arXiv preprint introduces CADE (Contrastive Alignment with Direct Embedding), a framework for time-series question answering (TSQA) that bypasses the tokenization bottleneck of standard LLMs by mapping each timestep directly into the LLM embedding space via a point-wise linear encoder and MLP projector. The approach also introduces a one-directional supervised contrastive loss to align time-series embeddings with frozen class-name text anchors, bridging the semantic gap between numerical and language representations. Evaluated on the Time-MQA benchmark across six TSQA tasks, CADE outperforms both open-source and proprietary LLM baselines. The work addresses a concrete limitation of patch-based encoders — fixed granularity and poor cross-dataset transfer — with a cleaner architectural alternative.

