Benchmarking Synthetic Data Methods for Education
A new study from arXiv presents the first systematic benchmark comparing traditional resampling techniques and deep generative models for synthetic data in education. Using a 10,000-record student performance dataset, researchers evaluated SMOTE, Bootstrap, and Random Oversampling against Autoencoder, Variational Autoencoder, and Copula-GAN. Metrics included distributional fidelity (Kolmogorov-Smirnov distance, Jensen-Shannon divergence), machine learning utility (Train-on-Synthetic-Test-on-Real scores), and privacy preservation (Distance to Closest Record). Results show resampling methods achieve near-perfect utility (TSTR: 0.997) but fail privacy (DCR ~ 0.00), while deep models offer better privacy at the cost of utility. The study provides empirical guidance for practitioners selecting synthetic data methods in educational technology.
Key facts
- First systematic benchmark comparing resampling and deep generative models for synthetic data in education
- Dataset: 10,000-record student performance dataset
- Resampling methods: SMOTE, Bootstrap, Random Oversampling
- Deep learning models: Autoencoder, Variational Autoencoder, Copula-GAN
- Evaluation metrics: Kolmogorov-Smirnov distance, Jensen-Shannon divergence, TSTR, Distance to Closest Record
- Resampling methods achieved TSTR of 0.997 but DCR ~ 0.00
- Deep models offer better privacy but lower utility
- Study provides empirical guidance for synthetic data selection
Entities
Institutions
- arXiv