Information-Theoretic Criterion for Efficient Synthetic Data Generation

ai-technology · 2026-05-20

A recent paper published on arXiv (2605.16379) presents an information-theoretic perspective on the inconsistencies often observed in synthetic data used for training large language models. The authors contend that such data enhances a model only when the generation-training process is 'information-open', influenced by external factors like verifiers, environments, or rubrics that provide relevant task information beyond the model's existing distribution. Conversely, an 'information-closed' loop, which depends solely on the model's outputs, leads to a decline in task-relevant information due to the data processing inequality, resulting in model collapse. In information-open systems, both efficiency and generalization are influenced by the level of supervision; for instance, a broad signal like binary correctness treats all acceptable outputs equally, promoting natural generalization without being confined to specific domains or formats.

Key facts

Paper is on arXiv with ID 2605.16379
Provides information-theoretic account of synthetic data inconsistency
Synthetic data improves model only when loop is information-open
Information-open loop uses external signals (verifiers, environments, rubrics)
Information-closed loop relies on model's own outputs
Data processing inequality ensures task-relevant information decreases in closed loop
Collapse is predicted outcome of information-closed loops
Coarse supervision like binary correctness aids generalization

Information-Theoretic Criterion for Efficient Synthetic Data Generation

Key facts

Entities

Institutions

Sources