ARTFEED — Contemporary Art Intelligence

VT-Bench: First Unified Benchmark for Visual-Tabular Learning

other · 2026-05-12

VT-Bench has been launched by researchers as the inaugural unified benchmark aimed at standardizing tasks related to vision-tabular discriminative prediction and generative reasoning. This benchmark encompasses 14 datasets from 9 distinct domains, including areas focused on healthcare, pets, media, and transportation, totaling over 756,000 samples. The research team assessed 23 representative models, which include unimodal specialists, dedicated visual-tabular models, general-purpose vision-language models (VLMs), and tool-augmented approaches. Findings reveal significant obstacles in visual-tabular learning, an underexplored field critical to high-stakes sectors such as healthcare and industry. The benchmark can be accessed publicly on GitHub.

Key facts

  • VT-Bench is the first unified benchmark for visual-tabular learning.
  • It covers discriminative prediction and generative reasoning tasks.
  • The benchmark includes 14 datasets from 9 domains.
  • Domains include medical-centric, pets, media, and transportation.
  • Over 756,000 samples are aggregated in VT-Bench.
  • 23 models were evaluated, including unimodal and multimodal approaches.
  • Visual-tabular learning is underexplored but critical for healthcare and industry.
  • The benchmark is available at the provided GitHub URL.

Entities

Institutions

  • arXiv

Sources