MHPR Benchmark Introduced for Human-Centric LVLM Evaluation
A new evaluation framework named Multidimensional Human Perception and Reasoning (MHPR) has been launched to assess large vision-language models (LVLMs) in human interaction scenarios. This framework addresses the demand for thorough evaluations by examining single and multiple person interactions, as well as interactions between humans and objects. MHPR is organized into four distinct components: Captioned Raw Data, Supervised Fine-Tuning Data, Reinforcement Learning Data, and Test Data. Additionally, it includes an automated system for generating captions and visual question answers, ensuring high-quality annotations. This research is documented in the preprint repository arXiv under the identifier 2605.03485.
Key facts
- MHPR is a comprehensive benchmark for joint perception-reasoning over human-centric scenes.
- It spans individual, multi-person, and human-object interaction dimensions.
- The benchmark comprises four data levels: C-RD, SFT-D, RL-D, and T-D.
- An automated pipeline (ACVG) generates captions and VQA data.
- ACVG uses category-wise attribute decomposition, attribute-specific rewriting, and multi-model voting.
- The evaluation focuses on fine-grained attributes like appearance, clothing, pose, and parts.
- State-of-the-art vision-language models are assessed on these attributes.
- The research is published on arXiv with ID 2605.03485.
Entities
Institutions
- arXiv