Multi-layer Cross-Attention Proves Optimal for Multi-modal In-context Learning
A new theoretical paper on arXiv (2602.04872) demonstrates that multi-layer cross-attention mechanisms are provably optimal for multi-modal in-context learning. The study introduces a mathematically tractable framework based on latent factor models to analyze transformer-like architectures. It proves that single-layer linear self-attention fails to achieve Bayes-optimal prediction uniformly across task distributions. To overcome this, the authors propose a linearized cross-attention mechanism and show that multi-layer cross-attention can recover Bayes-optimal performance in multi-modal settings. The work extends theoretical understanding of in-context learning from unimodal to multi-modal data, providing a foundation for designing more effective multi-modal AI systems.
Key facts
- Paper on arXiv: 2602.04872
- Focuses on multi-modal in-context learning
- Uses latent factor model to represent multi-modal data
- Single-layer linear self-attention is not Bayes-optimal
- Proposes linearized cross-attention mechanism
- Multi-layer cross-attention achieves Bayes-optimal performance
- Extends theory from unimodal to multi-modal data
- Provides framework for studying multi-modal learning
Entities
Institutions
- arXiv