Self-Training Restructures Language, Not Just Flattens It
A recent study disputes the common belief that repeated self-training on a language model's outputs leads to a decrease in linguistic diversity and a flattening of language. Researchers examined eleven generations of self-training across five models (GPT-2 124M, Pythia-410M, Pythia-1.4B, OPT-1.3B, Pythia-2.8B) and discovered that language undergoes restructuring rather than mere flattening. They observed an increase in surface markers like discourse connectives, hedges, and em-dashes, while more complex syntactic structures, such as questions, parentheticals, passives, and subjunctives, diminish. This phenomenon is termed the Structural Depth Hypothesis (SDH), suggesting that the decay rate of a linguistic feature correlates primarily with its structural depth. The research utilizes 17-feature panels from the five models and is accessible on arXiv with reference 2605.20602.
Key facts
- Self-training on a language model's own outputs restructures language rather than just flattening it.
- Eleven generations of self-training were tested on five models: GPT-2 124M, Pythia-410M, Pythia-1.4B, OPT-1.3B, Pythia-2.8B.
- Surface markers (discourse connectives, hedges, em-dashes) increase during self-training.
- Mid- and deep-syntactic structures (questions, parentheticals, passives, subjunctives) collapse.
- The Structural Depth Hypothesis (SDH) formalizes the asymmetric collapse.
- Decay rate of a linguistic feature is predicted primarily by its structural depth.
- Generation-zero output frequency is a secondary predictor of decay rate.
- The study pooled 17-feature panels from five models.
Entities
Institutions
- arXiv