6arXiv cs.CL (Computation and Language)·24d ago

SAERL: Using Sparse Autoencoders to Guide LLM Reinforcement Learning Data Engineering

SAERL is a post-training data engineering framework that uses Sparse Autoencoders (SAEs) — a mechanistic interpretability tool — to extract intrinsic model signals for controlling data diversity, difficulty, and quality during RL fine-tuning. The framework applies SAE-space clustering for batch diversity, a difficulty proxy for curriculum ordering, and a quality probe for data filtering. On Qwen2.5-Math-1.5B with GRPO, SAERL achieves 3% average accuracy improvement and reaches target accuracy with 20% fewer training steps. SAE representations transfer across model families and scales, suggesting broad applicability as a lightweight data engineering tool.

Training Infrastructure Evaluation and Benchmarking AI Safety Research Alignment and RLHF mechanistic interpretability GRPO Reinforcement Learning from Human Feedback SAERL Qwen Sparse Autoencoder Qwen2.5-Math-PRM

Related guides (3)

mechanistic interpretabilityConcept

Mechanistic Interpretability: Looking Inside the AI Black Box

Read asBeginner In-depth

GRPOConcept

GRPO: The Lightweight RL Trick Behind Today's Reasoning Models

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·23d ago·source ↗

SAEs as Stethoscopes: Interpretability-Guided Layer Selection for Task Vector Model Editing

This paper evaluates a Sparse Autoencoder (SAE)-guided model editing pipeline for mathematical reasoning on Gemma-3-4B-IT, finding that projecting task vectors onto SAE feature subspaces discards ~97% of modification energy due to geometric misalignment between activation-space SAE directions and weight-space task vectors. The authors reframe SAEs as diagnostic tools ('stethoscopes') rather than intervention filters ('scalpels'), using SAE-derived specificity scores to identify which layers to inject unfiltered task vectors into. This approach improves Number Theory accuracy from 29.6% to 39.4% on Minerva Math (p=0.0007), with 5 of 7 math subjects significantly improved and none degraded. The method is fully deterministic and adds no inference cost.

Evaluation and Benchmarking AI Safety Research Subspace Projection Gemma-3-4B-IT Sparse Autoencoders (SAEs)+4 more

6arXiv · cs.LG·4d ago·source ↗

ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning

ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.

Evaluation and Benchmarking Alignment and RLHF ExpRL: Exploratory RL for LLM Mid-Training GRPO ExpRL

6arXiv · cs.AI·12d ago·source ↗

Sparse AutoEncoder steering reduces Whisper hallucination rate by ~5x without fine-tuning

Researchers investigate hallucination detection and mitigation in OpenAI's Whisper ASR model by probing internal encoder representations. They find that both raw activations and Sparse AutoEncoder (SAE) latents encode linearly separable hallucination signals concentrated in deeper layers. SAE-based activation steering reduces hallucination rates from 72.6% to 14.1% (Whisper small) and 86.9% to 27.3% (Whisper large-v3) on non-speech audio, with minimal WER degradation, approaching fine-tuning-level performance without weight updates.

Evaluation and Benchmarking AI Safety Research Sparse Autoencoder OpenAI Whisper

6arXiv · cs.CL·9d ago·source ↗

Study finds SAE unstable features reflect reproducible subspaces, not pure noise

A new arXiv paper investigates feature stability in sparse autoencoders (SAEs), measuring the probability that individual learned features reappear across independent training runs. The authors find a functional asymmetry: stable features carry most reconstruction-relevant signal, while unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting seed dependence reflects basis ambiguity rather than noise. A synthetic model confirms that low-rank ground-truth features can be recovered at the subspace level even when individual SAE latents are non-identifiable across seeds. The work has direct implications for interpretability research that relies on SAE features as meaningful, stable units of analysis.

Evaluation and Benchmarking AI Safety Research Sparse Autoencoders Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

4arXiv · cs.LG·12d ago·source ↗

SETA: Sparse Subspace-to-Expert Sharing for Continual Learning in LLMs

Researchers introduce SETA (Mixture of Sparse Experts for Task Agnostic Continual Learning), a framework addressing catastrophic forgetting in LLMs via adaptive sparse subspace decomposition into task-specific and shared expert modules. The approach uses adaptive elastic anchoring and routing-aware regularization to protect shared knowledge at both weight and routing levels. Experiments on LLaMA-2 7B and Qwen3-4B show competitive or superior performance versus continual learning baselines, with strong retention of early-task knowledge.

Evaluation and Benchmarking Open Weights Progress LLaMA-7B Qwen3-4B Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning +1 more

6arXiv · cs.CL·2d ago·source ↗

STARE: Token-level advantage reweighting to prevent entropy collapse in GRPO-style RL training

Researchers introduce STARE, a method addressing policy entropy collapse in GRPO-style reinforcement learning from verifiable rewards (RLVR) for LLM post-training. Through first-order gradient analysis, they identify a token-level credit assignment mismatch and propose selectively reweighting advantages for entropy-critical tokens using batch-internal surprisal quantiles plus a closed-loop entropy gate. Evaluated across 1.5B–32B models on short/long chain-of-thought and multi-turn tool use tasks, STARE outperforms DAPO and other baselines by 4–8% on AIME24/25 while sustaining stable training over thousands of steps.

Frontier Model Releases Alignment and RLHF DAPO AIME 2026 GRPO +2 more

5arXiv · cs.AI·2d ago·source ↗

MAST: Mechanism-guided selective unlearning for RLVR-trained reasoning models

Researchers introduce MAST (Mechanism-Aligned Selective Targeting), a method for selectively unlearning capabilities induced by reinforcement learning from verifiable rewards (RLVR) in language models while minimizing collateral damage to retained knowledge. The approach ranks attention-projection tensors by off-principal energy and gradient coupling to identify a targeted subset for update, rather than applying full-parameter gradient ascent. Evaluated on Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, MAST achieves statistically significant forgetting on target MATH problems while preserving GSM8K performance, whereas full-parameter unlearning collapses retained capabilities. The method generalizes across seeds and unlearning objectives (NPO/SimNPO).

AI Safety Research Alignment and RLHF Qwen3-1.7B-Base MATH MAST +2 more

7arXiv · cs.CL·1mo ago·source ↗

RELEX: Extrapolating LLM RLVR Training via Rank-1 Parameter Trajectories

This paper demonstrates that RLVR weight update trajectories are extremely low-rank and near-linearly predictable, with a rank-1 approximation capturing most downstream performance gains. The authors propose RELEX, a compute-efficient method that observes a short training window, estimates the rank-1 subspace, and extrapolates future checkpoints via linear regression—requiring no additional training. Evaluated on Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base, RELEX matches or exceeds full RLVR performance using as few as 15% of training steps, and can extrapolate up to 10–20× beyond the observed prefix. The authors attribute the method's effectiveness to a denoising effect from rank-1 projection that discards stochastic optimization noise.

Training Infrastructure Frontier Model Releases RLVR Qwen3-8B-Base Qwen3-4B-Base +8 more