ML-FOP-SOAP: Second-Order Optimization for Multimodal Models
A new optimization framework, ML-FOP-SOAP, addresses modality competition in multimodal models trained with autoregressive next-token prediction. The method introduces Multi-Level Variance Correction via Fisher-Orthogonal Projection to reduce conflicts between visual generation and text understanding. It builds on second-order preconditioning (SOAP) to handle cross-modality gradient heterogeneity, which first-order optimizers like AdamW struggle with. A hierarchical folding strategy enables practical large-batch training with low overhead. Experiments on Janus and Emu3 show consistent improvements. The paper is available on arXiv (2605.16165).
Key facts
- ML-FOP-SOAP is a second-order optimization framework with Multi-Level Variance Correction
- It addresses modality competition in multimodal autoregressive models
- Fisher-Orthogonal Projection suppresses variance-induced modality conflicts
- First-order optimizers like AdamW are vulnerable to cross-modality gradient heterogeneity
- Second-order preconditioning (SOAP) provides a more stable basis for multimodal alignment
- A hierarchical folding strategy captures fine-grained variance with low micro-step overhead
- Experiments were conducted on Janus and Emu3 models
- The paper is published on arXiv with ID 2605.16165
Entities
Institutions
- arXiv