VL-LCM: A New Metric for Evaluating Vision-Language Logical Consistency in MLLMs Without Ground-Truth Annotations
Researchers propose the Vision-Language Logical Consistency Metric (VL-LCM) to evaluate multimodal large language models (MLLMs) on logical consistency without requiring ground-truth annotations. The metric is based on basic logic principles, assessing both sufficient and necessary cause-effect relations in vision-language tasks. VL-LCM is applied to traditional MC-VQA tests and recent NaturalBench tests. Systematic experiments on MMMU and NaturalBench benchmarks evaluated 11 open-source MLLMs from 4 frontier families. Findings reveal that while recent MLLMs show significant progress in accuracy, their logical consistency lags behind. The study also examines correlations between VL-LCM and ground-truth metrics, reliability of LCM, and related aspects.
Key facts
- VL-LCM evaluates vision-language logical consistency without ground-truth annotations.
- Metric is based on sufficient and necessary cause-effect relations.
- Applied to MC-VQA and NaturalBench tests.
- Tested on 11 open-source MLLMs from 4 frontier families.
- Evaluated on MMMU and NaturalBench benchmarks.
- Recent MLLMs show accuracy progress but logical consistency lags.
- Study includes correlations with ground-truth metrics and reliability analysis.
- Published on arXiv with ID 2605.06201.
Entities
Institutions
- arXiv