TFGN: Continual Pre-Training Without Catastrophic Forgetting at LLM Scale
Researchers have introduced a groundbreaking architectural enhancement for transformer language models called TFGN. This innovation facilitates ongoing pre-training without needing replay buffers, task identifiers, or regularization penalties. TFGN was assessed in six varied text fields—Prose, Python, Math, Biomedical, Chinese, and JavaScript—utilizing 1 billion tokens per phase across three model sizes (~398M, ~739M, ~9B) and two strategies (From-Scratch and Retrofit). The findings indicated a backward transfer of -0.007 for LLaMA 3.1 8B Retrofit, with HellaSwag retention rates of 0.506/0.504/0.510, and achieved over 99.59% L2-orthogonal gradient separation among domain pairs. This technique offers effective updates tailored to input while preserving the transformer’s structure and tackling catastrophic forgetting in large language models.
Key facts
- TFGN is an architectural overlay for transformer language models.
- It enables continual pre-training without replay buffers, task identifiers, or regularization penalties.
- Tested on six domains: Prose, Python, Math, Biomedical, Chinese, JavaScript.
- 1B tokens per phase across three model scales: ~398M, ~739M, ~9B.
- Two regimes: From-Scratch and Retrofit.
- Backward transfer of -0.007 at LLaMA 3.1 8B Retrofit.
- HellaSwag retention: 0.506/0.504/0.510.
- >=99.59% L2-orthogonal gradient separation between domain pairs.
Entities
Institutions
- arXiv