New Benchmark Exposes Evaluation Collapse in VLM Explainability

ai-technology · 2026-05-23

Researchers have identified a fundamental flaw in how vision-language model (VLM) explainability is evaluated. Current post-hoc explainers rely on unimodal perturbation metrics that fail because multimodal datasets contain language priors and modality biases, causing cross-modal redundancy. This leads to an evaluation collapse where visual and textual rankings contradict each other (Kendall's τ = -0.06). To address this, the team introduces Synergistic Faithfulness (F_syn), a metric based on the Shapley Interaction Index that isolates joint contributions between modalities. F_syn achieves high accuracy (ρ = 0.92) with a 24× computational speedup. The benchmark evaluates eight explainers across multiple VLM architectures, providing a more reliable standard for cross-modal reasoning interpretation.

Key facts

Current VLM explainability metrics suffer from evaluation collapse due to cross-modal redundancy.
Kendall's τ = -0.06 indicates fundamental contradiction between visual and textual rankings.
Synergistic Faithfulness (F_syn) is based on the Shapley Interaction Index.
F_syn achieves ρ = 0.92 accuracy and 24× computational speedup.
The benchmark evaluates eight explainers across multiple VLM architectures.
Multimodal datasets contain language priors and modality biases.
Unimodal perturbation metrics penalize faithful explainers.
The paper is published on arXiv with ID 2605.22168.

New Benchmark Exposes Evaluation Collapse in VLM Explainability

Key facts

Entities

Institutions

Sources