SOUBench: New Benchmark Reveals MLLMs Struggle with Small Objects
Researchers have introduced SOUBench, the first comprehensive benchmark for evaluating Multimodal Large Language Models (MLLMs) on Small Object Understanding (SOU) tasks. The benchmark includes an SOU-VQA evaluation dataset with 18,204 visual question-answer pairs across six sub-tasks and three dominant scenarios: Driving, Aerial, and Underwater. An automatic visual question-answer generation strategy was designed to construct the dataset. Evaluation of 15 state-of-the-art MLLMs revealed weak capabilities in small object understanding. To address this, the team developed SOU-Train, a multimodal training dataset with 11,226 VQA pairs, aimed at improving SOU performance. The study highlights a significant gap in current MLLM abilities and provides resources for future research.
Key facts
- SOUBench is the first comprehensive benchmark for small object understanding in MLLMs.
- The SOU-VQA dataset contains 18,204 VQA pairs.
- Six relevant sub-tasks are included in the benchmark.
- Three dominant scenarios: Driving, Aerial, and Underwater.
- 15 state-of-the-art MLLMs were evaluated.
- MLLMs showed weak capabilities in small object understanding.
- SOU-Train training dataset has 11,226 VQA pairs.
- An automatic visual question-answer generation strategy was used.
Entities
—