Gated DeltaNet-2 Decouples Erase and Write in Linear Attention
A recent publication on arXiv presents Gated DeltaNet-2, which features a linear attention mechanism that distinguishes between erasing and writing functions through the use of channel-wise gates. This linear attention substitutes the limitless cache found in softmax attention with a fixed-size recurrent state, enabling sequence mixing in linear time and constant memory for decoding. The difficulty lies in modifying this compact memory without disrupting existing connections. Delta-rule models involve subtracting the current read prior to writing a new value, while Kimi Delta Attention (KDA) enhances forgetting through channel-wise decay. Nonetheless, the active edit still relies on a single scalar gate for both operations. Gated DeltaNet-2 improves upon Gated DeltaNet and KDA by separating these functions with distinct channel-wise erase gate b_t and write gate.
Key facts
- Gated DeltaNet-2 is introduced in arXiv paper 2605.22791.
- Linear attention replaces softmax attention's unbounded cache with a fixed-size recurrent state.
- It reduces sequence mixing to linear time and decoding to constant memory.
- Delta-rule models subtract the current read before writing a new value.
- Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay.
- Previous models used a single scalar gate to control both erasing and writing.
- Gated DeltaNet-2 separates erasing and writing with channel-wise gates b_t and write gate.
- It generalizes both Gated DeltaNet and KDA.
Entities
Institutions
- arXiv