VL-LCM: A New Metric for Evaluating Vision-Language Logical Consistency in MLLMs Without Ground-Truth Annotations

ai-technology · 2026-05-09

Researchers propose the Vision-Language Logical Consistency Metric (VL-LCM) to evaluate multimodal large language models (MLLMs) on logical consistency without requiring ground-truth annotations. The metric is based on basic logic principles, assessing both sufficient and necessary cause-effect relations in vision-language tasks. VL-LCM is applied to traditional MC-VQA tests and recent NaturalBench tests. Systematic experiments on MMMU and NaturalBench benchmarks evaluated 11 open-source MLLMs from 4 frontier families. Findings reveal that while recent MLLMs show significant progress in accuracy, their logical consistency lags behind. The study also examines correlations between VL-LCM and ground-truth metrics, reliability of LCM, and related aspects.

Key facts

VL-LCM evaluates vision-language logical consistency without ground-truth annotations.
Metric is based on sufficient and necessary cause-effect relations.
Applied to MC-VQA and NaturalBench tests.
Tested on 11 open-source MLLMs from 4 frontier families.
Evaluated on MMMU and NaturalBench benchmarks.
Recent MLLMs show accuracy progress but logical consistency lags.
Study includes correlations with ground-truth metrics and reliability analysis.
Published on arXiv with ID 2605.06201.

VL-LCM: A New Metric for Evaluating Vision-Language Logical Consistency in MLLMs Without Ground-Truth Annotations

Key facts

Entities

Institutions

Sources