ARTFEED — Contemporary Art Intelligence

SOUBench: New Benchmark Reveals MLLMs Struggle with Small Objects

ai-technology · 2026-04-29

Researchers have introduced SOUBench, the first comprehensive benchmark for evaluating Multimodal Large Language Models (MLLMs) on Small Object Understanding (SOU) tasks. The benchmark includes an SOU-VQA evaluation dataset with 18,204 visual question-answer pairs across six sub-tasks and three dominant scenarios: Driving, Aerial, and Underwater. An automatic visual question-answer generation strategy was designed to construct the dataset. Evaluation of 15 state-of-the-art MLLMs revealed weak capabilities in small object understanding. To address this, the team developed SOU-Train, a multimodal training dataset with 11,226 VQA pairs, aimed at improving SOU performance. The study highlights a significant gap in current MLLM abilities and provides resources for future research.

Key facts

  • SOUBench is the first comprehensive benchmark for small object understanding in MLLMs.
  • The SOU-VQA dataset contains 18,204 VQA pairs.
  • Six relevant sub-tasks are included in the benchmark.
  • Three dominant scenarios: Driving, Aerial, and Underwater.
  • 15 state-of-the-art MLLMs were evaluated.
  • MLLMs showed weak capabilities in small object understanding.
  • SOU-Train training dataset has 11,226 VQA pairs.
  • An automatic visual question-answer generation strategy was used.

Entities

Sources