ARTFEED — Contemporary Art Intelligence

Benchmarking Synthetic Data Methods for Education

publication · 2026-04-25

A new study from arXiv presents the first systematic benchmark comparing traditional resampling techniques and deep generative models for synthetic data in education. Using a 10,000-record student performance dataset, researchers evaluated SMOTE, Bootstrap, and Random Oversampling against Autoencoder, Variational Autoencoder, and Copula-GAN. Metrics included distributional fidelity (Kolmogorov-Smirnov distance, Jensen-Shannon divergence), machine learning utility (Train-on-Synthetic-Test-on-Real scores), and privacy preservation (Distance to Closest Record). Results show resampling methods achieve near-perfect utility (TSTR: 0.997) but fail privacy (DCR ~ 0.00), while deep models offer better privacy at the cost of utility. The study provides empirical guidance for practitioners selecting synthetic data methods in educational technology.

Key facts

  • First systematic benchmark comparing resampling and deep generative models for synthetic data in education
  • Dataset: 10,000-record student performance dataset
  • Resampling methods: SMOTE, Bootstrap, Random Oversampling
  • Deep learning models: Autoencoder, Variational Autoencoder, Copula-GAN
  • Evaluation metrics: Kolmogorov-Smirnov distance, Jensen-Shannon divergence, TSTR, Distance to Closest Record
  • Resampling methods achieved TSTR of 0.997 but DCR ~ 0.00
  • Deep models offer better privacy but lower utility
  • Study provides empirical guidance for synthetic data selection

Entities

Institutions

  • arXiv

Sources