SIV-Bench: New Video Benchmark Tests MLLMs on Social Interaction

ai-technology · 2026-05-01

A new benchmark, named SIV-Bench, has been introduced to assess the capabilities of Multimodal Large Language Models (MLLMs) in social interaction analysis. This innovative tool is based on social relation theory and evaluates three main skills: understanding social scenes, reasoning about social states, and predicting social dynamics. The benchmark comprises 2,792 unique video clips and 5,455 carefully crafted question-answer pairs. Details of this research, which seeks to fill the gap in rigorous benchmarks for social interaction, can be found in a recent publication on arXiv. The goal is to improve how machines interpret human behavior and enhance interaction.

Key facts

SIV-Bench is a video benchmark for social interaction understanding.
It evaluates MLLMs on Social Scene Understanding, Social State Reasoning, and Social Dynamics Prediction.
The benchmark is based on social relation theory.
It includes 2,792 video clips and 5,455 question-answer pairs.
The paper is available on arXiv with ID 2506.05425v3.
The benchmark aims to fill a gap in evaluating MLLMs' social interaction abilities.
Social interaction involves multimodal cues, mental states, and behavior prediction.
The work supports advancements in human-machine interaction.

SIV-Bench: New Video Benchmark Tests MLLMs on Social Interaction

Key facts

Entities

Institutions

Sources