
Sparse Autoencoder
sparse-autoencoder-52aa1950·5 events·first seen 28d agoAliases: Sparse Autoencoder
Co-occurring entities
More like this (12)
Recent events (5)
Extracting Concepts from GPT-4: 16 Million Patterns via Sparse Autoencoders
OpenAI applied scaled sparse autoencoders (SAEs) to GPT-4 to automatically identify approximately 16 million interpretable features or patterns in the model's internal computations. This represents a significant scaling of mechanistic interpretability techniques previously demonstrated on smaller models. The work advances the ability to understand what concepts and representations large frontier models encode internally.
SAERL: Using Sparse Autoencoders to Guide LLM Reinforcement Learning Data Engineering
SAERL is a post-training data engineering framework that uses Sparse Autoencoders (SAEs) — a mechanistic interpretability tool — to extract intrinsic model signals for controlling data diversity, difficulty, and quality during RL fine-tuning. The framework applies SAE-space clustering for batch diversity, a difficulty proxy for curriculum ordering, and a quality probe for data filtering. On Qwen2.5-Math-1.5B with GRPO, SAERL achieves 3% average accuracy improvement and reaches target accuracy with 20% fewer training steps. SAE representations transfer across model families and scales, suggesting broad applicability as a lightweight data engineering tool.
Auditing Asset-Specific Preferences in Financial LLMs: Bitcoin Representations and Portfolio Allocation
Researchers develop a three-level audit protocol to test whether LLMs carry built-in biases toward specific financial assets, applying it to Bitcoin across eight frontier models. Using sparse autoencoder features in Gemma 3, they identify a dominant Bitcoin-selective internal feature whose amplification raises Bitcoin's portfolio share by 5.2 percentage points and suppression lowers it by 4.6 pp, even when 'Bitcoin' never appears in the prompt. The work introduces the concept of 'bounded behavioral leverage'—causal influence over outputs via identifiable internal representations—and frames the framework as a foundation for 'know-your-agent' (KYA) standards for autonomous financial agents.
Sparse AutoEncoder steering reduces Whisper hallucination rate by ~5x without fine-tuning
Researchers investigate hallucination detection and mitigation in OpenAI's Whisper ASR model by probing internal encoder representations. They find that both raw activations and Sparse AutoEncoder (SAE) latents encode linearly separable hallucination signals concentrated in deeper layers. SAE-based activation steering reduces hallucination rates from 72.6% to 14.1% (Whisper small) and 86.9% to 27.3% (Whisper large-v3) on non-speech audio, with minimal WER degradation, approaching fine-tuning-level performance without weight updates.
RLHF produces shallow political neutrality by severing causal pathways, not erasing partisan structure
Researchers compare internal representations of Llama 3.1 8B before and after RLHF, finding that alignment training does not remove partisan political geometry from the model but instead compresses output variance to produce balanced responses. Sparse autoencoder decomposition shows that policy-encoding features active in the base model become completely inactive in the instruction-tuned version, while feature-level steering experiments confirm the causal disconnect is real. The underlying partisan structure remains intact and can be reactivated by inferring and amplifying a user's partisan identity, suggesting RLHF alignment is functionally fragile. The authors argue this 'disconnection rather than removal' pattern may generalize to other value domains beyond political orientation.