TBP-mHC: Full Expressivity for Manifold-Constrained Hyper Connections via Transportation Polytopes
A novel parameterization technique known as Transportation Birkhoff Polytope (TBP) and its recursive version (RTBP) have been introduced to mitigate training instability in Hyper-Connections (HC) within residual networks. While HC enhances residual networks by allowing learnable mixing across various residual streams, unrestricted mixing leads to instability. Manifold-Constrained Hyper-Connections (mHC) impose approximate double stochasticity through Sinkhorn normalization, whereas mHC-lite achieves exact constraints using convex combinations of permutation matrices at factorial costs. KromHC lowers expenses with Kronecker-product parameterizations but limits mixing matrices to a structured submanifold of the Birkhoff polytope. TBP and RTBP generate precisely doubly stochastic mixing matrices with (n-1)^2 degrees of freedom, eliminating the need for iterative normalization and combinatorial explosions while maintaining the Birkhoff polytope's full expressivity. Empirical evidence from language tasks supports their effectiveness.
Key facts
- TBP and RTBP parameterizations construct exactly doubly stochastic mixing matrices.
- They achieve (n-1)^2 degrees of freedom.
- The approach avoids iterative normalization and combinatorial explosion.
- It preserves full expressivity of the Birkhoff polytope.
- Empirical results on language tasks are reported.
- Hyper-Connections improve residual networks via learnable mixing across multiple residual streams.
- Unconstrained mixing leads to training instability.
- mHC enforces approximate double stochasticity via Sinkhorn normalization.
- mHC-lite ensures exact constraints via convex combinations of permutation matrices at factorial cost.
- KromHC uses Kronecker-product parameterizations but restricts to a structured submanifold.
Entities
—