Adaptive Memory Decay for Log-Linear Attention

ai-technology · 2026-05-11

Researchers propose learning the memory decay parameter in log-linear attention models directly from input data, replacing the fixed, content-independent parameter. The approach uses a lightweight two-layer MLP with softplus activation to produce per-token, per-level decay, allowing each Fenwick tree hierarchy level to scale independently. This addresses the rigidity of fixed decay in log-linear attention, which previously assigned uniform weights across hierarchy levels regardless of content. The method aims to improve the tradeoff between memory capacity and computational efficiency in sequence models, building on the log-linear attention architecture that organizes memory across a Fenwick tree hierarchy with log-linear compute cost.

Key facts

Log-linear attention uses a Fenwick tree hierarchy for memory organization.
The memory decay parameter λ was previously fixed and input-independent.
The proposed method learns λ via a two-layer MLP.
Softplus activation enables independent scaling per Fenwick tree level.
The approach produces per-token, per-level decay.
It addresses rigidity in the original log-linear attention model.
The work is published on arXiv with ID 2605.06946.
The method aims to improve the memory-efficiency tradeoff in sequence models.

Adaptive Memory Decay for Log-Linear Attention

Key facts

Entities

Institutions

Sources