FineBench Benchmark Tests VLMs on Fine-Grained Human Activity
FineBench, a new benchmark centered on human-centric video question answering, has been developed by researchers to evaluate the fine-grained comprehension abilities of Vision-Language Models (VLMs). This benchmark features 199,420 multiple-choice question-and-answer pairs meticulously annotated across 64 long-form videos, each lasting 15 minutes. It emphasizes intricate details of person movements, interactions, and object manipulation, including complex actions. This initiative fills a significant gap, as current human-centric benchmarks fail to integrate long-form videos, extensive QA coverage, and frame-level spatial/temporal grounding on a large scale. The research indicates that VLMs frequently encounter difficulties with the nuanced understanding necessary for real-world scenarios involving human actions and interactions.
Key facts
- FineBench is a human-centric VQA benchmark for fine-grained understanding.
- It includes 199,420 multiple-choice QA pairs.
- The benchmark uses 64 long-form videos, each 15 minutes.
- Annotations cover person movement, interaction, and object manipulation.
- It addresses gaps in existing benchmarks lacking long-form videos and dense QA.
- VLMs currently struggle with fine-grained human activity comprehension.
- The benchmark includes compositional actions.
- The study was published on arXiv with ID 2605.19846.
Entities
Institutions
- arXiv