ARTFEED — Contemporary Art Intelligence

Multi-layer Cross-Attention Proves Optimal for Multi-modal In-context Learning

ai-technology · 2026-04-30

A new theoretical paper on arXiv (2602.04872) demonstrates that multi-layer cross-attention mechanisms are provably optimal for multi-modal in-context learning. The study introduces a mathematically tractable framework based on latent factor models to analyze transformer-like architectures. It proves that single-layer linear self-attention fails to achieve Bayes-optimal prediction uniformly across task distributions. To overcome this, the authors propose a linearized cross-attention mechanism and show that multi-layer cross-attention can recover Bayes-optimal performance in multi-modal settings. The work extends theoretical understanding of in-context learning from unimodal to multi-modal data, providing a foundation for designing more effective multi-modal AI systems.

Key facts

  • Paper on arXiv: 2602.04872
  • Focuses on multi-modal in-context learning
  • Uses latent factor model to represent multi-modal data
  • Single-layer linear self-attention is not Bayes-optimal
  • Proposes linearized cross-attention mechanism
  • Multi-layer cross-attention achieves Bayes-optimal performance
  • Extends theory from unimodal to multi-modal data
  • Provides framework for studying multi-modal learning

Entities

Institutions

  • arXiv

Sources