GRASP Dataset Enhances Social Reasoning in AI via Gaze and Gesture
A team of researchers has unveiled GRASP, an extensive dataset aimed at enhancing the reasoning capabilities of multimodal large language models (MLLMs) regarding social interactions in videos featuring multiple individuals. This dataset comprises 290,000 question-answer pairs derived from 46,000 videos, which collectively span 749 hours. It is categorized into 16 distinct types, focusing on gaze, gesture, and the interplay between gaze and gesture reasoning. In contrast to previous datasets that emphasize singular cues or abstract social questions, GRASP formulates inquiries based on consistent gaze patterns, deictic gestures, and their integration into social contexts. Additionally, the researchers introduce the Social Grounding Reward (SGR) to motivate models to analyze the participants in these interactions. The findings are available on arXiv with the identifier 2605.15764.
Key facts
- GRASP dataset contains 290K question-answer pairs over 46K videos totaling 749 hours.
- It uses a 16-category taxonomy spanning gaze, gesture, and joint gaze-gesture reasoning.
- GRASP focuses on identity-consistent gaze trajectories and deictic gestures.
- Social Grounding Reward (SGR) is a proposed learning signal.
- The dataset aims to improve MLLMs' social reasoning in multi-person videos.
- Published on arXiv with identifier 2605.15764.
- Prior resources focused on either isolated cues or high-level social QA.
- GRASP connects high-level social QA with fine-grained gaze and gesture events.
Entities
Institutions
- arXiv