MechaRule: Grounding LLM Rule Extraction in Neural Circuits
Researchers propose MechaRule, a pipeline that extracts symbolic rules from large language models by grounding them in specific neurons. The method identifies 'agonist' neurons whose activation neutralization disrupts rule-related behaviors. It leverages the observation that sparse agonist effects are approximately monotone and saturating, enabling efficient localization without hand-crafted hypotheses. The approach bridges global rule extraction and mechanistic interpretability.
Key facts
- MechaRule is a pipeline for rule extraction from LLMs grounded in neural circuits.
- It identifies sparse neurons called agonists whose neutralization disrupts rule-related behaviors.
- The method is based on empirical observations of monotone and saturating agonist effects.
- It avoids hand-crafted hypotheses and expensive neuron-level interventions.
- The approach combines global rule extraction with mechanistic interpretability.
- The research is published on arXiv with ID 2605.03058.
- The paper is categorized under explainable AI (XAI).
- The method uses contrastive hierarchical ablation for neuron localization.
Entities
Institutions
- arXiv