Hidden Failure of Gradient Modification Under Adam in Continual Learning
A recent preprint on arXiv (2604.22407) indicates that methods for modifying gradients, such as penalty rescaling, projection, and replay mixing, do not perform well when used with the Adam optimizer in continual learning scenarios. In an 8-domain continual language model, the performance of all shared-routing projection baselines deteriorated to levels similar to vanilla forgetting (12.5–12.8 compared to 13.2). A 0.5% replay buffer achieved 11.6, while fixed-strength decoupling dropped below vanilla at 14.1. Only adaptive decoupled routing showed stability with a score of 9.4, surpassing vanilla by 3.8 units. In a 16-domain stream, its advantage over the top shared-routing projection baseline increased to 4.5–4.8 units. This failure remains largely undetected on clean benchmarks, as explained by the paper's discussion of Adam's second-moment pathway, where projection leads to a 1/(1-alpha) increase in the effective learning rate in old directions.
Key facts
- Gradient modification methods fail under Adam in continual learning
- 8-domain continual LM: shared-routing projection baselines collapse (12.5–12.8 vs. 13.2)
- 0.5% replay buffer strongest shared alternative at 11.6
- Fixed-strength decoupling falls below vanilla at 14.1
- Adaptive decoupled routing stable at 9.4, improves by 3.8 units
- On 16-domain stream, gain over projection baseline grows to 4.5–4.8 units
- Failure invisible on clean benchmarks
- Adam's second-moment pathway causes 1/(1-alpha) inflation of effective learning rate
Entities
Institutions
- arXiv