Omnimodal LLMs Fail to Act on Perceptual Contradictions

ai-technology · 2026-05-14

A new study from arXiv introduces IMAVB, a benchmark of 500 long-form movie clips designed to test whether omnimodal large language models can detect conflicts between textual premises and their own sensory input. The research reveals a 'Representation-Action Gap': models like Gemini 3.1 Pro and eight open-source systems encode premise-perception mismatches in their hidden states but almost never reject false claims in their outputs. The benchmark uses a 2x2 design crossing target modality (vision, audio) with premise condition (standard, misleading). The findings suggest that current omnimodal LLMs fail at a basic form of grounding, raising questions about their reliability as perception-grounded agents.

Key facts

IMAVB benchmark contains 500 long-form movie clips.
Study tests conflict detection across vision and audio modalities.
Eight open-source omnimodal LLMs and Gemini 3.1 Pro were evaluated.
Representation-Action Gap: hidden states encode mismatches but outputs do not reject false claims.
Models fall into two behavioral categories.
Benchmark uses 2x2 design: target modality (vision, audio) and premise condition (standard, misleading).
Research highlights untested grounding in omnimodal models.
Published on arXiv with ID 2605.13737.

Omnimodal LLMs Fail to Act on Perceptual Contradictions

Key facts

Entities

Institutions

Sources