SOUBench: New Benchmark Reveals MLLMs Struggle with Small Objects

ai-technology · 2026-04-29

Researchers have introduced SOUBench, the first comprehensive benchmark for evaluating Multimodal Large Language Models (MLLMs) on Small Object Understanding (SOU) tasks. The benchmark includes an SOU-VQA evaluation dataset with 18,204 visual question-answer pairs across six sub-tasks and three dominant scenarios: Driving, Aerial, and Underwater. An automatic visual question-answer generation strategy was designed to construct the dataset. Evaluation of 15 state-of-the-art MLLMs revealed weak capabilities in small object understanding. To address this, the team developed SOU-Train, a multimodal training dataset with 11,226 VQA pairs, aimed at improving SOU performance. The study highlights a significant gap in current MLLM abilities and provides resources for future research.

Key facts

SOUBench is the first comprehensive benchmark for small object understanding in MLLMs.
The SOU-VQA dataset contains 18,204 VQA pairs.
Six relevant sub-tasks are included in the benchmark.
Three dominant scenarios: Driving, Aerial, and Underwater.
15 state-of-the-art MLLMs were evaluated.
MLLMs showed weak capabilities in small object understanding.
SOU-Train training dataset has 11,226 VQA pairs.
An automatic visual question-answer generation strategy was used.

Entities

—

Sources

arXiv cs.AI — 2026-04-28