Self-Training Restructures Language, Not Just Flattens It

other · 2026-05-22

A recent study disputes the common belief that repeated self-training on a language model's outputs leads to a decrease in linguistic diversity and a flattening of language. Researchers examined eleven generations of self-training across five models (GPT-2 124M, Pythia-410M, Pythia-1.4B, OPT-1.3B, Pythia-2.8B) and discovered that language undergoes restructuring rather than mere flattening. They observed an increase in surface markers like discourse connectives, hedges, and em-dashes, while more complex syntactic structures, such as questions, parentheticals, passives, and subjunctives, diminish. This phenomenon is termed the Structural Depth Hypothesis (SDH), suggesting that the decay rate of a linguistic feature correlates primarily with its structural depth. The research utilizes 17-feature panels from the five models and is accessible on arXiv with reference 2605.20602.

Key facts

Self-training on a language model's own outputs restructures language rather than just flattening it.
Eleven generations of self-training were tested on five models: GPT-2 124M, Pythia-410M, Pythia-1.4B, OPT-1.3B, Pythia-2.8B.
Surface markers (discourse connectives, hedges, em-dashes) increase during self-training.
Mid- and deep-syntactic structures (questions, parentheticals, passives, subjunctives) collapse.
The Structural Depth Hypothesis (SDH) formalizes the asymmetric collapse.
Decay rate of a linguistic feature is predicted primarily by its structural depth.
Generation-zero output frequency is a secondary predictor of decay rate.
The study pooled 17-feature panels from five models.

Self-Training Restructures Language, Not Just Flattens It

Key facts

Entities

Institutions

Sources