Cross-Modal Information Flow in Audio-Visual LLMs

publication · 2026-05-12

A new paper on arXiv (2605.10815) investigates how audio-visual large language models (AVLLMs) process cross-modal information between audio and video. The authors analyze multiple recent AVLLMs and find that integrated audio-visual information is primarily encoded in sink tokens, and that sink tokens do not uniformly hold cross-modal information. The study aims to understand the internal mechanisms of AVLLMs, which remain largely unexplored compared to text-only or vision-language models.

Key facts

Paper available on arXiv with ID 2605.10815
Focuses on cross-modal information flow between audio and visual modalities in AVLLMs
Analyzes multiple recent AVLLMs
Finds that AVLLMs primarily encode integrated audio-visual information in sink tokens
Sink tokens do not uniformly hold cross-modal information
AVLLMs are a powerful architecture for joint reasoning over audio, visual, and textual modalities
Bidirectional interaction between audio and video introduces intricate processing dynamics
Internal workings of AVLLMs remain largely unexplored

Cross-Modal Information Flow in Audio-Visual LLMs

Key facts

Entities

Institutions

Sources