ARTFEED — Contemporary Art Intelligence

FineBench Benchmark Tests VLMs on Fine-Grained Human Activity

ai-technology · 2026-05-20

FineBench, a new benchmark centered on human-centric video question answering, has been developed by researchers to evaluate the fine-grained comprehension abilities of Vision-Language Models (VLMs). This benchmark features 199,420 multiple-choice question-and-answer pairs meticulously annotated across 64 long-form videos, each lasting 15 minutes. It emphasizes intricate details of person movements, interactions, and object manipulation, including complex actions. This initiative fills a significant gap, as current human-centric benchmarks fail to integrate long-form videos, extensive QA coverage, and frame-level spatial/temporal grounding on a large scale. The research indicates that VLMs frequently encounter difficulties with the nuanced understanding necessary for real-world scenarios involving human actions and interactions.

Key facts

  • FineBench is a human-centric VQA benchmark for fine-grained understanding.
  • It includes 199,420 multiple-choice QA pairs.
  • The benchmark uses 64 long-form videos, each 15 minutes.
  • Annotations cover person movement, interaction, and object manipulation.
  • It addresses gaps in existing benchmarks lacking long-form videos and dense QA.
  • VLMs currently struggle with fine-grained human activity comprehension.
  • The benchmark includes compositional actions.
  • The study was published on arXiv with ID 2605.19846.

Entities

Institutions

  • arXiv

Sources