ML-FOP-SOAP: Second-Order Optimization for Multimodal Models

other · 2026-05-18

A new optimization framework, ML-FOP-SOAP, addresses modality competition in multimodal models trained with autoregressive next-token prediction. The method introduces Multi-Level Variance Correction via Fisher-Orthogonal Projection to reduce conflicts between visual generation and text understanding. It builds on second-order preconditioning (SOAP) to handle cross-modality gradient heterogeneity, which first-order optimizers like AdamW struggle with. A hierarchical folding strategy enables practical large-batch training with low overhead. Experiments on Janus and Emu3 show consistent improvements. The paper is available on arXiv (2605.16165).

Key facts

ML-FOP-SOAP is a second-order optimization framework with Multi-Level Variance Correction
It addresses modality competition in multimodal autoregressive models
Fisher-Orthogonal Projection suppresses variance-induced modality conflicts
First-order optimizers like AdamW are vulnerable to cross-modality gradient heterogeneity
Second-order preconditioning (SOAP) provides a more stable basis for multimodal alignment
A hierarchical folding strategy captures fine-grained variance with low micro-step overhead
Experiments were conducted on Janus and Emu3 models
The paper is published on arXiv with ID 2605.16165

ML-FOP-SOAP: Second-Order Optimization for Multimodal Models

Key facts

Entities

Institutions

Sources