Research Reveals Visual Preference Shift in Omni-modal Large Language Models
A recent study available on arXiv (ID: 2604.16902v1) investigates modality preference in Omni-modal Large Language Models (OLLMs), highlighting a notable shift from a focus on text to a stronger inclination towards visual inputs. The researchers created a conflict-based benchmark along with a metric for modality selection rates to assess ten representative OLLMs. Results indicate that the majority display a clear visual preference, contrasting with the text-centric nature of traditional Vision-Language Models (VLMs). The study reveals that this preference develops progressively in the mid-to-late layers through layer-wise probing. By utilizing these internal signals, the research effectively addresses cross-modal hallucinations and performs well across three multi-modal benchmarks, challenging existing beliefs about text dominance in multi-modal AI.
Key facts
- arXiv ID: 2604.16902v1 announces new research on Omni-modal Large Language Models
- Study systematically quantifies modality preference using conflict-based benchmark
- Ten representative OLLMs evaluated reveal shift from text-dominance to visual preference
- Modality selection rate metric developed for evaluation
- Layer-wise probing shows preference emerges in mid-to-late layers
- Research leverages internal signals to diagnose cross-modal hallucinations
- Achieves competitive performance across three downstream multi-modal benchmarks
- Addresses critical gap in understanding native OLLM architecture behavior
Entities
Institutions
- arXiv