TFGN: Continual Pre-Training Without Catastrophic Forgetting at LLM Scale

ai-technology · 2026-05-16

Researchers have introduced a groundbreaking architectural enhancement for transformer language models called TFGN. This innovation facilitates ongoing pre-training without needing replay buffers, task identifiers, or regularization penalties. TFGN was assessed in six varied text fields—Prose, Python, Math, Biomedical, Chinese, and JavaScript—utilizing 1 billion tokens per phase across three model sizes (~398M, ~739M, ~9B) and two strategies (From-Scratch and Retrofit). The findings indicated a backward transfer of -0.007 for LLaMA 3.1 8B Retrofit, with HellaSwag retention rates of 0.506/0.504/0.510, and achieved over 99.59% L2-orthogonal gradient separation among domain pairs. This technique offers effective updates tailored to input while preserving the transformer’s structure and tackling catastrophic forgetting in large language models.

Key facts

TFGN is an architectural overlay for transformer language models.
It enables continual pre-training without replay buffers, task identifiers, or regularization penalties.
Tested on six domains: Prose, Python, Math, Biomedical, Chinese, JavaScript.
1B tokens per phase across three model scales: ~398M, ~739M, ~9B.
Two regimes: From-Scratch and Retrofit.
Backward transfer of -0.007 at LLaMA 3.1 8B Retrofit.
HellaSwag retention: 0.506/0.504/0.510.
>=99.59% L2-orthogonal gradient separation between domain pairs.

TFGN: Continual Pre-Training Without Catastrophic Forgetting at LLM Scale

Key facts

Entities

Institutions

Sources