Multi-Gate Residuals: Stabilizing Deep Networks Without Communication Overhead

other · 2026-05-25

A new machine learning method called Multi-Gate Residuals (MGR) addresses the problem of unbounded activation growth in deep residual networks. Unlike Attention Residuals, which reduce this issue but introduce significant communication overhead, MGR stabilizes activation scales without extra communication cost. It uses a scoring and gating mechanism to maintain multi-stream context and Attention Pooling to extract hidden states. Empirical experiments show MGR is practical for large-scale training and deployment, offering performance improvements over existing architectures.

Key facts

Multi-Gate Residuals (MGR) is proposed to stabilize activation scales.
MGR avoids the communication overhead of Attention Residuals.
MGR uses a scoring and gating mechanism for multi-stream context.
Attention Pooling extracts hidden states from stream states.
Experiments show MGR is practical for large-scale training and deployment.
MGR offers tangible performance improvements over existing architectures.

Multi-Gate Residuals: Stabilizing Deep Networks Without Communication Overhead

Key facts

Entities

Institutions

Sources