Tabular Foundation Models: Comparing Real vs. Synthetic Training Data Distributions
A new study on arXiv (2605.06343) compares three types of pre-training corpora for tabular foundation models: web-scraped tables (T4 dataset), curated benchmark tables (TabFM dataset), and synthetic tables from a parametric generative prior (TabICL dataset). The research characterizes each corpus using aggregate features over tables, columns, and correlations, employing discriminator AUCs and k-NN coverage metrics. Key finding: the TabICL synthetic prior occupies a narrow region of the distribution space, potentially limiting its representativeness. The study highlights the lack of understanding about how these corpora relate distributionally and the impact on downstream performance.
Key facts
- Study compares three tabular foundation model training corpora: T4 (web-scraped), TabFM (curated from Kaggle), TabICL (synthetic).
- TabICL is the only well-used synthetic prior with publicly available parameters.
- Corpora characterized using aggregate features over whole tables, columns, and correlations.
- Comparison methods: discriminator AUCs and k-NN coverage metrics.
- TabICL synthetic prior occupies a narrow region of distribution space.
- Research addresses the gap in understanding distributional relationships among pre-training corpora.
- Pre-training data centrality to model performance is emphasized.
- Study appears on arXiv with identifier 2605.06343.
Entities
Institutions
- arXiv