ARTFEED — Contemporary Art Intelligence

Metagame Framework Quantifies Second-Order Effects in AI Model Explanations

ai-technology · 2026-05-09

A new conceptual framework called the metagame has been developed by researchers to assess second-order interaction effects in model explanations. This framework evaluates the directional impact of one feature on the attribution of another, referred to as meta-attribution, by modeling the attribution method as a cooperative game and calculating its Shapley value. The researchers theoretically demonstrate that attributions can be hierarchically broken down into meta-attributions, which serve as directional extensions of current interaction indices. Empirically, the metagame provides valuable insights in various interpretability contexts, including measuring token interactions in instruction-tuned language models, elucidating cross-modal similarities in vision-language encoders, and interpreting concepts in text-to-image multimodal diffusion transformers. This study is available on arXiv in the computer science and machine learning sections.

Key facts

  • The metagame framework quantifies second-order interaction effects of model explanations.
  • Meta-attribution measures directional influence of feature j on attribution of feature i.
  • Attribution method is treated as a cooperative game and its Shapley value computed.
  • Attributions hierarchically decompose into meta-attributions.
  • Meta-attributions are directional extensions of existing interaction indices.
  • Applications include token interactions in instruction-tuned language models.
  • Applications include cross-modal similarity in vision-language encoders.
  • Applications include interpreting text-to-image concepts in multimodal diffusion transformers.

Entities

Institutions

  • arXiv

Sources