MaskTab: A New Pre-Training Framework for Industrial Tabular Data
MaskTab is an integrated self-supervised pre-training system tailored for large-scale tabular datasets, tackling issues such as high dimensionality, absent data, and limited labels. It utilizes specific learnable tokens to encode missing values, allowing for a clear distinction between structural absence and random dropout. The framework optimizes a combined supervised pre-training approach with a dual-path architecture that aligns masked reconstruction with task-oriented supervision, alongside a MoE-enhanced loss that dynamically directs features through specialized subnetworks. In tests on industrial-scale benchmarks, MaskTab shows a +5.0 enhancement compared to previous techniques. The research can be found on arXiv with the identifier 2605.11408.
Key facts
- MaskTab is a unified pre-training framework for industrial tabular data.
- It uses dedicated learnable tokens to encode missing values.
- The framework employs a twin-path architecture for hybrid supervised pre-training.
- MaskTab incorporates an MoE-augmented loss for adaptive feature routing.
- It achieves a +5.0 improvement on industrial-scale benchmarks.
- The paper is published on arXiv with ID 2605.11408.
- Tabular data is foundational in finance, healthcare, and other high-stakes domains.
- Industrial tabular datasets are often high-dimensional, missing entries, and rarely labeled.
Entities
Institutions
- arXiv