PLOT: New Framework for Neural Causal Abstraction via Optimal Transport
A new framework called PLOT (Progressive Localization via Optimal Transport) has been introduced for neural causal abstraction, a method for mechanistic interpretability. PLOT uses optimal transport to localize causal variables by analyzing the output effect geometry of abstract and neural interventions, fitting a coupling between abstract variables and candidate neural sites to produce a global soft correspondence. This approach addresses the computational burden of existing methods like distributed alignment search (DAS), which require searching over candidate sites. In simple settings, PLOT works with a single coupling over individual neurons; for larger models, it is applied progressively.
Key facts
- PLOT stands for Progressive Localization via Optimal Transport.
- It is a transport-based framework for neural causal abstraction.
- It localizes causal variables from output effect geometry of abstract and neural interventions.
- It fits an optimal transport coupling between abstract variables and candidate neural sites.
- The coupling yields a global soft correspondence that can be calibrated into intervention handles.
- In simple settings, a single coupling over individual neurons suffices.
- In larger models, PLOT is applied progressively.
- The framework addresses computational burdens of existing methods like DAS.
Entities
—