New Benchmark Exposes Evaluation Collapse in VLM Explainability
Researchers have identified a fundamental flaw in how vision-language model (VLM) explainability is evaluated. Current post-hoc explainers rely on unimodal perturbation metrics that fail because multimodal datasets contain language priors and modality biases, causing cross-modal redundancy. This leads to an evaluation collapse where visual and textual rankings contradict each other (Kendall's τ = -0.06). To address this, the team introduces Synergistic Faithfulness (F_syn), a metric based on the Shapley Interaction Index that isolates joint contributions between modalities. F_syn achieves high accuracy (ρ = 0.92) with a 24× computational speedup. The benchmark evaluates eight explainers across multiple VLM architectures, providing a more reliable standard for cross-modal reasoning interpretation.
Key facts
- Current VLM explainability metrics suffer from evaluation collapse due to cross-modal redundancy.
- Kendall's τ = -0.06 indicates fundamental contradiction between visual and textual rankings.
- Synergistic Faithfulness (F_syn) is based on the Shapley Interaction Index.
- F_syn achieves ρ = 0.92 accuracy and 24× computational speedup.
- The benchmark evaluates eight explainers across multiple VLM architectures.
- Multimodal datasets contain language priors and modality biases.
- Unimodal perturbation metrics penalize faithful explainers.
- The paper is published on arXiv with ID 2605.22168.
Entities
Institutions
- arXiv