ARTFEED — Contemporary Art Intelligence

Omnimodal LLMs Fail to Act on Perceptual Contradictions

ai-technology · 2026-05-14

A new study from arXiv introduces IMAVB, a benchmark of 500 long-form movie clips designed to test whether omnimodal large language models can detect conflicts between textual premises and their own sensory input. The research reveals a 'Representation-Action Gap': models like Gemini 3.1 Pro and eight open-source systems encode premise-perception mismatches in their hidden states but almost never reject false claims in their outputs. The benchmark uses a 2x2 design crossing target modality (vision, audio) with premise condition (standard, misleading). The findings suggest that current omnimodal LLMs fail at a basic form of grounding, raising questions about their reliability as perception-grounded agents.

Key facts

  • IMAVB benchmark contains 500 long-form movie clips.
  • Study tests conflict detection across vision and audio modalities.
  • Eight open-source omnimodal LLMs and Gemini 3.1 Pro were evaluated.
  • Representation-Action Gap: hidden states encode mismatches but outputs do not reject false claims.
  • Models fall into two behavioral categories.
  • Benchmark uses 2x2 design: target modality (vision, audio) and premise condition (standard, misleading).
  • Research highlights untested grounding in omnimodal models.
  • Published on arXiv with ID 2605.13737.

Entities

Institutions

  • arXiv

Sources