Omnimodal LLMs Fail to Act on Perceptual Contradictions
A new study from arXiv introduces IMAVB, a benchmark of 500 long-form movie clips designed to test whether omnimodal large language models can detect conflicts between textual premises and their own sensory input. The research reveals a 'Representation-Action Gap': models like Gemini 3.1 Pro and eight open-source systems encode premise-perception mismatches in their hidden states but almost never reject false claims in their outputs. The benchmark uses a 2x2 design crossing target modality (vision, audio) with premise condition (standard, misleading). The findings suggest that current omnimodal LLMs fail at a basic form of grounding, raising questions about their reliability as perception-grounded agents.
Key facts
- IMAVB benchmark contains 500 long-form movie clips.
- Study tests conflict detection across vision and audio modalities.
- Eight open-source omnimodal LLMs and Gemini 3.1 Pro were evaluated.
- Representation-Action Gap: hidden states encode mismatches but outputs do not reject false claims.
- Models fall into two behavioral categories.
- Benchmark uses 2x2 design: target modality (vision, audio) and premise condition (standard, misleading).
- Research highlights untested grounding in omnimodal models.
- Published on arXiv with ID 2605.13737.
Entities
Institutions
- arXiv