ARTFEED — Contemporary Art Intelligence

Information-Theoretic Criterion for Efficient Synthetic Data Generation

ai-technology · 2026-05-20

A recent paper published on arXiv (2605.16379) presents an information-theoretic perspective on the inconsistencies often observed in synthetic data used for training large language models. The authors contend that such data enhances a model only when the generation-training process is 'information-open', influenced by external factors like verifiers, environments, or rubrics that provide relevant task information beyond the model's existing distribution. Conversely, an 'information-closed' loop, which depends solely on the model's outputs, leads to a decline in task-relevant information due to the data processing inequality, resulting in model collapse. In information-open systems, both efficiency and generalization are influenced by the level of supervision; for instance, a broad signal like binary correctness treats all acceptable outputs equally, promoting natural generalization without being confined to specific domains or formats.

Key facts

  • Paper is on arXiv with ID 2605.16379
  • Provides information-theoretic account of synthetic data inconsistency
  • Synthetic data improves model only when loop is information-open
  • Information-open loop uses external signals (verifiers, environments, rubrics)
  • Information-closed loop relies on model's own outputs
  • Data processing inequality ensures task-relevant information decreases in closed loop
  • Collapse is predicted outcome of information-closed loops
  • Coarse supervision like binary correctness aids generalization

Entities

Institutions

  • arXiv

Sources