Interpretability-Guided Data Selection Boosts LLM Fine-Tuning Efficiency

ai-technology · 2026-04-30

A new framework called Interpretability-Guided Data Selection (IGDS) has been introduced by researchers. This approach employs mechanistic interpretability tools, such as Sparse Autoencoders (SAEs), to pinpoint causal features related to tasks in Large Language Models (LLMs) and to choose 'Feature-Resonant Data' for fine-tuning. Initially, IGDS identifies task features through frequency recall and interventional filtering, subsequently selecting data that best activates these features. Tested on tasks like mathematical reasoning, summarization, and translation using the Gemma-2, LLaMA-3.1, and Qwen3 models, IGDS shows remarkable data efficiency. In the Math task, it outperforms full-dataset fine-tuning by 17.4% on Gemma-2-2B while utilizing considerably less data, effectively linking mechanistic interpretability with model optimization.

Key facts

IGDS framework uses Sparse Autoencoders (SAEs) to identify causal task features.
Features are identified through frequency recall and interventional filtering.
Selected 'Feature-Resonant Data' maximally activates task features for fine-tuning.
Validated on Gemma-2, LLaMA-3.1, and Qwen3 models.
Tasks include mathematical reasoning, summarization, and translation.
On Math task, IGDS outperforms full-dataset fine-tuning by 17.4% on Gemma-2-2B.
IGDS achieves higher performance with less data.
Framework transforms mechanistic interpretability insights into practical actions.

Entities

—

Sources

arXiv cs.AI — 2026-04-29