ARTFEED — Contemporary Art Intelligence

Synthetic Data Risks for Causal Inference

other · 2026-04-29

A recent study published on arXiv (2604.23904) indicates that generative synthetic data, such as those derived from GAN and LLM models, may skew causal estimands like the average treatment effect (ATE), despite achieving high predictive accuracy. The researchers articulate this issue through sensitivity analyses and tradeoff findings, demonstrating that maintaining ATE necessitates managing both the generated covariate distribution and the treatment-effect difference. To address this, they introduce a hybrid framework that generates covariates independently from treatment and outcome processes, employing distance-to-closest-record diagnostics alongside distinct nuisance models.

Key facts

  • arXiv paper 2604.23904
  • Generative tabular synthesizers distort ATE
  • GAN- and LLM-based models tested
  • Strong train-on-synthetic-test-on-real performance observed
  • ATE preservation requires control of covariate law and treatment-effect contrast
  • Hybrid framework proposed
  • Distance-to-closest-record diagnostics used
  • Separately learned nuisance models for (W, A, Y) triplets

Entities

Institutions

  • arXiv

Sources