Multi-Gate Residuals: Stabilizing Deep Networks Without Communication Overhead
A new machine learning method called Multi-Gate Residuals (MGR) addresses the problem of unbounded activation growth in deep residual networks. Unlike Attention Residuals, which reduce this issue but introduce significant communication overhead, MGR stabilizes activation scales without extra communication cost. It uses a scoring and gating mechanism to maintain multi-stream context and Attention Pooling to extract hidden states. Empirical experiments show MGR is practical for large-scale training and deployment, offering performance improvements over existing architectures.
Key facts
- Multi-Gate Residuals (MGR) is proposed to stabilize activation scales.
- MGR avoids the communication overhead of Attention Residuals.
- MGR uses a scoring and gating mechanism for multi-stream context.
- Attention Pooling extracts hidden states from stream states.
- Experiments show MGR is practical for large-scale training and deployment.
- MGR offers tangible performance improvements over existing architectures.
Entities
Institutions
- arXiv