Multi-layer Cross-Attention Proves Optimal for Multi-modal In-context Learning

ai-technology · 2026-04-30

A new theoretical paper on arXiv (2602.04872) demonstrates that multi-layer cross-attention mechanisms are provably optimal for multi-modal in-context learning. The study introduces a mathematically tractable framework based on latent factor models to analyze transformer-like architectures. It proves that single-layer linear self-attention fails to achieve Bayes-optimal prediction uniformly across task distributions. To overcome this, the authors propose a linearized cross-attention mechanism and show that multi-layer cross-attention can recover Bayes-optimal performance in multi-modal settings. The work extends theoretical understanding of in-context learning from unimodal to multi-modal data, providing a foundation for designing more effective multi-modal AI systems.

Key facts

Paper on arXiv: 2602.04872
Focuses on multi-modal in-context learning
Uses latent factor model to represent multi-modal data
Single-layer linear self-attention is not Bayes-optimal
Proposes linearized cross-attention mechanism
Multi-layer cross-attention achieves Bayes-optimal performance
Extends theory from unimodal to multi-modal data
Provides framework for studying multi-modal learning

Multi-layer Cross-Attention Proves Optimal for Multi-modal In-context Learning

Key facts

Entities

Institutions

Sources