Entity · product

SAERL

productactivesaerl-f43a8960·1 events·first seen May 27, 2026

Aliases: SAERL

Co-occurring entities

mechanistic interpretability GRPO Reinforcement Learning from Human Feedback Qwen Sparse Autoencoder Qwen2.5-Math-PRM

More like this (12)

SE-RRM SARDI RELAI SDAR SAIR SEDD SEAME SARA RLAES SPEAR SERA NERSC

Recent events (1)

6arXiv · cs.CL·May 27, 2026·source ↗

SAERL: Using Sparse Autoencoders to Guide LLM Reinforcement Learning Data Engineering

SAERL is a post-training data engineering framework that uses Sparse Autoencoders (SAEs) — a mechanistic interpretability tool — to extract intrinsic model signals for controlling data diversity, difficulty, and quality during RL fine-tuning. The framework applies SAE-space clustering for batch diversity, a difficulty proxy for curriculum ordering, and a quality probe for data filtering. On Qwen2.5-Math-1.5B with GRPO, SAERL achieves 3% average accuracy improvement and reaches target accuracy with 20% fewer training steps. SAE representations transfer across model families and scales, suggesting broad applicability as a lightweight data engineering tool.

Training Infrastructure Evaluation and Benchmarking mechanistic interpretability GRPO Reinforcement Learning from Human Feedback +6 more