SAERL: Using Sparse Autoencoders to Guide LLM Post-Training Data Engineering
A new framework named SAERL has been introduced by researchers, which utilizes the internal mechanisms of Sparse Autoencoders (SAEs) to create post-training datasets for large language models (LLMs). This framework captures three key data characteristics—diversity, difficulty, and quality—through features derived from SAEs. It allows for the management of batch diversity through clustering in SAE space, organizes a curriculum from easy to hard based on a difficulty proxy, and filters data using a quality probe. When implemented on Qwen2.5-Math-1.5B, SAERL enhances average accuracy by 3.00% compared to standard GRPO and achieves target accuracy with 20% fewer training steps, showing consistent improvements across various model sizes and reinforcement learning algorithms.
Key facts
- SAERL is a data engineering framework for LLM reinforcement learning (RL).
- It uses Sparse Autoencoder (SAE) to extract model internals.
- Three intrinsic data properties: diversity, difficulty, quality.
- Operations: SAE-space clustering with moderate batch mixing, difficulty proxy for curriculum ordering, quality probe for filtering.
- Improves average accuracy by 3.00% over vanilla GRPO.
- Reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B.
- Consistent gains across model scales and RL algorithms.
- Published on arXiv: 2605.27354.
Entities
Institutions
- arXiv