Metagame Framework Quantifies Second-Order Effects in AI Model Explanations
A new conceptual framework called the metagame has been developed by researchers to assess second-order interaction effects in model explanations. This framework evaluates the directional impact of one feature on the attribution of another, referred to as meta-attribution, by modeling the attribution method as a cooperative game and calculating its Shapley value. The researchers theoretically demonstrate that attributions can be hierarchically broken down into meta-attributions, which serve as directional extensions of current interaction indices. Empirically, the metagame provides valuable insights in various interpretability contexts, including measuring token interactions in instruction-tuned language models, elucidating cross-modal similarities in vision-language encoders, and interpreting concepts in text-to-image multimodal diffusion transformers. This study is available on arXiv in the computer science and machine learning sections.
Key facts
- The metagame framework quantifies second-order interaction effects of model explanations.
- Meta-attribution measures directional influence of feature j on attribution of feature i.
- Attribution method is treated as a cooperative game and its Shapley value computed.
- Attributions hierarchically decompose into meta-attributions.
- Meta-attributions are directional extensions of existing interaction indices.
- Applications include token interactions in instruction-tuned language models.
- Applications include cross-modal similarity in vision-language encoders.
- Applications include interpreting text-to-image concepts in multimodal diffusion transformers.
Entities
Institutions
- arXiv