Synthetic Data Risks for Causal Inference

other · 2026-04-29

A recent study published on arXiv (2604.23904) indicates that generative synthetic data, such as those derived from GAN and LLM models, may skew causal estimands like the average treatment effect (ATE), despite achieving high predictive accuracy. The researchers articulate this issue through sensitivity analyses and tradeoff findings, demonstrating that maintaining ATE necessitates managing both the generated covariate distribution and the treatment-effect difference. To address this, they introduce a hybrid framework that generates covariates independently from treatment and outcome processes, employing distance-to-closest-record diagnostics alongside distinct nuisance models.

Key facts

arXiv paper 2604.23904
Generative tabular synthesizers distort ATE
GAN- and LLM-based models tested
Strong train-on-synthetic-test-on-real performance observed
ATE preservation requires control of covariate law and treatment-effect contrast
Hybrid framework proposed
Distance-to-closest-record diagnostics used
Separately learned nuisance models for (W, A, Y) triplets

Synthetic Data Risks for Causal Inference

Key facts

Entities

Institutions

Sources