ARTFEED — Contemporary Art Intelligence

TFGN: Continual Pre-Training Without Catastrophic Forgetting at LLM Scale

ai-technology · 2026-05-16

Researchers have introduced a groundbreaking architectural enhancement for transformer language models called TFGN. This innovation facilitates ongoing pre-training without needing replay buffers, task identifiers, or regularization penalties. TFGN was assessed in six varied text fields—Prose, Python, Math, Biomedical, Chinese, and JavaScript—utilizing 1 billion tokens per phase across three model sizes (~398M, ~739M, ~9B) and two strategies (From-Scratch and Retrofit). The findings indicated a backward transfer of -0.007 for LLaMA 3.1 8B Retrofit, with HellaSwag retention rates of 0.506/0.504/0.510, and achieved over 99.59% L2-orthogonal gradient separation among domain pairs. This technique offers effective updates tailored to input while preserving the transformer’s structure and tackling catastrophic forgetting in large language models.

Key facts

  • TFGN is an architectural overlay for transformer language models.
  • It enables continual pre-training without replay buffers, task identifiers, or regularization penalties.
  • Tested on six domains: Prose, Python, Math, Biomedical, Chinese, JavaScript.
  • 1B tokens per phase across three model scales: ~398M, ~739M, ~9B.
  • Two regimes: From-Scratch and Retrofit.
  • Backward transfer of -0.007 at LLaMA 3.1 8B Retrofit.
  • HellaSwag retention: 0.506/0.504/0.510.
  • >=99.59% L2-orthogonal gradient separation between domain pairs.

Entities

Institutions

  • arXiv

Sources