ARTFEED — Contemporary Art Intelligence

ML-FOP-SOAP: Second-Order Optimization for Multimodal Models

other · 2026-05-18

A new optimization framework, ML-FOP-SOAP, addresses modality competition in multimodal models trained with autoregressive next-token prediction. The method introduces Multi-Level Variance Correction via Fisher-Orthogonal Projection to reduce conflicts between visual generation and text understanding. It builds on second-order preconditioning (SOAP) to handle cross-modality gradient heterogeneity, which first-order optimizers like AdamW struggle with. A hierarchical folding strategy enables practical large-batch training with low overhead. Experiments on Janus and Emu3 show consistent improvements. The paper is available on arXiv (2605.16165).

Key facts

  • ML-FOP-SOAP is a second-order optimization framework with Multi-Level Variance Correction
  • It addresses modality competition in multimodal autoregressive models
  • Fisher-Orthogonal Projection suppresses variance-induced modality conflicts
  • First-order optimizers like AdamW are vulnerable to cross-modality gradient heterogeneity
  • Second-order preconditioning (SOAP) provides a more stable basis for multimodal alignment
  • A hierarchical folding strategy captures fine-grained variance with low micro-step overhead
  • Experiments were conducted on Janus and Emu3 models
  • The paper is published on arXiv with ID 2605.16165

Entities

Institutions

  • arXiv

Sources