ARTFEED — Contemporary Art Intelligence

New Benchmark Exposes Evaluation Collapse in VLM Explainability

ai-technology · 2026-05-23

Researchers have identified a fundamental flaw in how vision-language model (VLM) explainability is evaluated. Current post-hoc explainers rely on unimodal perturbation metrics that fail because multimodal datasets contain language priors and modality biases, causing cross-modal redundancy. This leads to an evaluation collapse where visual and textual rankings contradict each other (Kendall's τ = -0.06). To address this, the team introduces Synergistic Faithfulness (F_syn), a metric based on the Shapley Interaction Index that isolates joint contributions between modalities. F_syn achieves high accuracy (ρ = 0.92) with a 24× computational speedup. The benchmark evaluates eight explainers across multiple VLM architectures, providing a more reliable standard for cross-modal reasoning interpretation.

Key facts

  • Current VLM explainability metrics suffer from evaluation collapse due to cross-modal redundancy.
  • Kendall's τ = -0.06 indicates fundamental contradiction between visual and textual rankings.
  • Synergistic Faithfulness (F_syn) is based on the Shapley Interaction Index.
  • F_syn achieves ρ = 0.92 accuracy and 24× computational speedup.
  • The benchmark evaluates eight explainers across multiple VLM architectures.
  • Multimodal datasets contain language priors and modality biases.
  • Unimodal perturbation metrics penalize faithful explainers.
  • The paper is published on arXiv with ID 2605.22168.

Entities

Institutions

  • arXiv

Sources