Metagame Framework Quantifies Second-Order Effects in AI Model Explanations

ai-technology · 2026-05-09

A new conceptual framework called the metagame has been developed by researchers to assess second-order interaction effects in model explanations. This framework evaluates the directional impact of one feature on the attribution of another, referred to as meta-attribution, by modeling the attribution method as a cooperative game and calculating its Shapley value. The researchers theoretically demonstrate that attributions can be hierarchically broken down into meta-attributions, which serve as directional extensions of current interaction indices. Empirically, the metagame provides valuable insights in various interpretability contexts, including measuring token interactions in instruction-tuned language models, elucidating cross-modal similarities in vision-language encoders, and interpreting concepts in text-to-image multimodal diffusion transformers. This study is available on arXiv in the computer science and machine learning sections.

Key facts

The metagame framework quantifies second-order interaction effects of model explanations.
Meta-attribution measures directional influence of feature j on attribution of feature i.
Attribution method is treated as a cooperative game and its Shapley value computed.
Attributions hierarchically decompose into meta-attributions.
Meta-attributions are directional extensions of existing interaction indices.
Applications include token interactions in instruction-tuned language models.
Applications include cross-modal similarity in vision-language encoders.
Applications include interpreting text-to-image concepts in multimodal diffusion transformers.

Metagame Framework Quantifies Second-Order Effects in AI Model Explanations

Key facts

Entities

Institutions

Sources