StoSignSGD Algorithm Fixes SignSGD Divergence Issues for Large Language Model Training

ai-technology · 2026-04-20

A new optimization algorithm called StoSignSGD has been developed to address the divergence problems of SignSGD when training large foundation models. Sign-based optimization methods like SignSGD have shown impressive results in distributed learning environments and with large language models, but they fail to converge on non-smooth objectives commonly found in modern machine learning architectures. These non-smooth functions arise from components like ReLUs, max-pooling layers, and mixture-of-experts systems. The proposed StoSignSGD introduces structural stochasticity into the sign operator while preserving an unbiased update step. Theoretical analysis demonstrates that StoSignSGD achieves a sharp convergence rate matching the lower bound for (online) convex optimization problems. For more complex non-convex non-smooth optimization scenarios, the researchers developed generalized stationary measures that include previous definitions, proving StoSignSGD's superior performance. The algorithm specifically targets the training of large language models where previous sign-based methods struggled with convergence. This research addresses a fundamental limitation in optimization techniques for contemporary machine learning systems. The work was announced on arXiv with identifier 2604.15416v1 as a cross-disciplinary contribution.

Key facts

StoSignSGD is a new optimization algorithm that fixes SignSGD divergence issues
SignSGD has shown remarkable performance in distributed learning and training large foundation models
SignSGD diverges on non-smooth objectives common in modern machine learning
Non-smooth objectives come from ReLUs, max-pools, and mixture-of-experts systems
StoSignSGD injects structural stochasticity while maintaining unbiased updates
Theoretical analysis shows StoSignSGD achieves sharp convergence rate matching lower bound
Researchers introduced generalized stationary measures for non-convex non-smooth optimization
The algorithm specifically targets training of large language models

StoSignSGD Algorithm Fixes SignSGD Divergence Issues for Large Language Model Training

Key facts

Entities

Institutions

Sources