ARTFEED — Contemporary Art Intelligence

Cross-Modal Information Flow in Audio-Visual LLMs

publication · 2026-05-12

A new paper on arXiv (2605.10815) investigates how audio-visual large language models (AVLLMs) process cross-modal information between audio and video. The authors analyze multiple recent AVLLMs and find that integrated audio-visual information is primarily encoded in sink tokens, and that sink tokens do not uniformly hold cross-modal information. The study aims to understand the internal mechanisms of AVLLMs, which remain largely unexplored compared to text-only or vision-language models.

Key facts

  • Paper available on arXiv with ID 2605.10815
  • Focuses on cross-modal information flow between audio and visual modalities in AVLLMs
  • Analyzes multiple recent AVLLMs
  • Finds that AVLLMs primarily encode integrated audio-visual information in sink tokens
  • Sink tokens do not uniformly hold cross-modal information
  • AVLLMs are a powerful architecture for joint reasoning over audio, visual, and textual modalities
  • Bidirectional interaction between audio and video introduces intricate processing dynamics
  • Internal workings of AVLLMs remain largely unexplored

Entities

Institutions

  • arXiv

Sources