Research Reveals Visual Preference Shift in Omni-modal Large Language Models

ai-technology · 2026-04-22

A recent study available on arXiv (ID: 2604.16902v1) investigates modality preference in Omni-modal Large Language Models (OLLMs), highlighting a notable shift from a focus on text to a stronger inclination towards visual inputs. The researchers created a conflict-based benchmark along with a metric for modality selection rates to assess ten representative OLLMs. Results indicate that the majority display a clear visual preference, contrasting with the text-centric nature of traditional Vision-Language Models (VLMs). The study reveals that this preference develops progressively in the mid-to-late layers through layer-wise probing. By utilizing these internal signals, the research effectively addresses cross-modal hallucinations and performs well across three multi-modal benchmarks, challenging existing beliefs about text dominance in multi-modal AI.

Key facts

arXiv ID: 2604.16902v1 announces new research on Omni-modal Large Language Models
Study systematically quantifies modality preference using conflict-based benchmark
Ten representative OLLMs evaluated reveal shift from text-dominance to visual preference
Modality selection rate metric developed for evaluation
Layer-wise probing shows preference emerges in mid-to-late layers
Research leverages internal signals to diagnose cross-modal hallucinations
Achieves competitive performance across three downstream multi-modal benchmarks
Addresses critical gap in understanding native OLLM architecture behavior

Research Reveals Visual Preference Shift in Omni-modal Large Language Models

Key facts

Entities

Institutions

Sources