Cross-Modal Information Flow in Audio-Visual LLMs
A new paper on arXiv (2605.10815) investigates how audio-visual large language models (AVLLMs) process cross-modal information between audio and video. The authors analyze multiple recent AVLLMs and find that integrated audio-visual information is primarily encoded in sink tokens, and that sink tokens do not uniformly hold cross-modal information. The study aims to understand the internal mechanisms of AVLLMs, which remain largely unexplored compared to text-only or vision-language models.
Key facts
- Paper available on arXiv with ID 2605.10815
- Focuses on cross-modal information flow between audio and visual modalities in AVLLMs
- Analyzes multiple recent AVLLMs
- Finds that AVLLMs primarily encode integrated audio-visual information in sink tokens
- Sink tokens do not uniformly hold cross-modal information
- AVLLMs are a powerful architecture for joint reasoning over audio, visual, and textual modalities
- Bidirectional interaction between audio and video introduces intricate processing dynamics
- Internal workings of AVLLMs remain largely unexplored
Entities
Institutions
- arXiv