VT-Bench: First Unified Benchmark for Visual-Tabular Learning

other · 2026-05-12

VT-Bench has been launched by researchers as the inaugural unified benchmark aimed at standardizing tasks related to vision-tabular discriminative prediction and generative reasoning. This benchmark encompasses 14 datasets from 9 distinct domains, including areas focused on healthcare, pets, media, and transportation, totaling over 756,000 samples. The research team assessed 23 representative models, which include unimodal specialists, dedicated visual-tabular models, general-purpose vision-language models (VLMs), and tool-augmented approaches. Findings reveal significant obstacles in visual-tabular learning, an underexplored field critical to high-stakes sectors such as healthcare and industry. The benchmark can be accessed publicly on GitHub.

Key facts

VT-Bench is the first unified benchmark for visual-tabular learning.
It covers discriminative prediction and generative reasoning tasks.
The benchmark includes 14 datasets from 9 domains.
Domains include medical-centric, pets, media, and transportation.
Over 756,000 samples are aggregated in VT-Bench.
23 models were evaluated, including unimodal and multimodal approaches.
Visual-tabular learning is underexplored but critical for healthcare and industry.
The benchmark is available at the provided GitHub URL.

VT-Bench: First Unified Benchmark for Visual-Tabular Learning

Key facts

Entities

Institutions

Sources