Information-Theoretic Criterion for Efficient Synthetic Data Generation
A recent paper published on arXiv (2605.16379) presents an information-theoretic perspective on the inconsistencies often observed in synthetic data used for training large language models. The authors contend that such data enhances a model only when the generation-training process is 'information-open', influenced by external factors like verifiers, environments, or rubrics that provide relevant task information beyond the model's existing distribution. Conversely, an 'information-closed' loop, which depends solely on the model's outputs, leads to a decline in task-relevant information due to the data processing inequality, resulting in model collapse. In information-open systems, both efficiency and generalization are influenced by the level of supervision; for instance, a broad signal like binary correctness treats all acceptable outputs equally, promoting natural generalization without being confined to specific domains or formats.
Key facts
- Paper is on arXiv with ID 2605.16379
- Provides information-theoretic account of synthetic data inconsistency
- Synthetic data improves model only when loop is information-open
- Information-open loop uses external signals (verifiers, environments, rubrics)
- Information-closed loop relies on model's own outputs
- Data processing inequality ensures task-relevant information decreases in closed loop
- Collapse is predicted outcome of information-closed loops
- Coarse supervision like binary correctness aids generalization
Entities
Institutions
- arXiv