VT-Bench: First Unified Benchmark for Visual-Tabular Learning
VT-Bench has been launched by researchers as the inaugural unified benchmark aimed at standardizing tasks related to vision-tabular discriminative prediction and generative reasoning. This benchmark encompasses 14 datasets from 9 distinct domains, including areas focused on healthcare, pets, media, and transportation, totaling over 756,000 samples. The research team assessed 23 representative models, which include unimodal specialists, dedicated visual-tabular models, general-purpose vision-language models (VLMs), and tool-augmented approaches. Findings reveal significant obstacles in visual-tabular learning, an underexplored field critical to high-stakes sectors such as healthcare and industry. The benchmark can be accessed publicly on GitHub.
Key facts
- VT-Bench is the first unified benchmark for visual-tabular learning.
- It covers discriminative prediction and generative reasoning tasks.
- The benchmark includes 14 datasets from 9 domains.
- Domains include medical-centric, pets, media, and transportation.
- Over 756,000 samples are aggregated in VT-Bench.
- 23 models were evaluated, including unimodal and multimodal approaches.
- Visual-tabular learning is underexplored but critical for healthcare and industry.
- The benchmark is available at the provided GitHub URL.
Entities
Institutions
- arXiv