StoSignSGD Algorithm Fixes SignSGD Divergence Issues for Large Language Model Training
A new optimization algorithm called StoSignSGD has been developed to address the divergence problems of SignSGD when training large foundation models. Sign-based optimization methods like SignSGD have shown impressive results in distributed learning environments and with large language models, but they fail to converge on non-smooth objectives commonly found in modern machine learning architectures. These non-smooth functions arise from components like ReLUs, max-pooling layers, and mixture-of-experts systems. The proposed StoSignSGD introduces structural stochasticity into the sign operator while preserving an unbiased update step. Theoretical analysis demonstrates that StoSignSGD achieves a sharp convergence rate matching the lower bound for (online) convex optimization problems. For more complex non-convex non-smooth optimization scenarios, the researchers developed generalized stationary measures that include previous definitions, proving StoSignSGD's superior performance. The algorithm specifically targets the training of large language models where previous sign-based methods struggled with convergence. This research addresses a fundamental limitation in optimization techniques for contemporary machine learning systems. The work was announced on arXiv with identifier 2604.15416v1 as a cross-disciplinary contribution.
Key facts
- StoSignSGD is a new optimization algorithm that fixes SignSGD divergence issues
- SignSGD has shown remarkable performance in distributed learning and training large foundation models
- SignSGD diverges on non-smooth objectives common in modern machine learning
- Non-smooth objectives come from ReLUs, max-pools, and mixture-of-experts systems
- StoSignSGD injects structural stochasticity while maintaining unbiased updates
- Theoretical analysis shows StoSignSGD achieves sharp convergence rate matching lower bound
- Researchers introduced generalized stationary measures for non-convex non-smooth optimization
- The algorithm specifically targets training of large language models
Entities
Institutions
- arXiv