EgoCoT-Bench: New Benchmark for Egocentric Video Reasoning in MLLMs

other · 2026-05-20

Researchers have introduced EgoCoT-Bench, a benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on egocentric video understanding. The benchmark focuses on fine-grained operation-centric reasoning, requiring models to recognize hand-object interactions, track object state changes, and reason about manipulative processes from a first-person perspective. EgoCoT-Bench addresses the lack of grounded rationale evaluation in existing benchmarks by providing explicit step-by-step rationale annotations. It comprises 3,172 verifiable QA pairs over 351 egocentric videos, organized into four task groups. The work is detailed in a paper on arXiv (2605.19559).

Key facts

EgoCoT-Bench is a new benchmark for egocentric video understanding.
It targets operation-centric reasoning in MLLMs.
Includes 3,172 verifiable QA pairs over 351 videos.
Videos are separated into four task groups.
Provides explicit step-by-step rationale annotations.
Addresses limited grounded rationale evaluation in existing benchmarks.
Focuses on fine-grained hand-object interactions and object state changes.
Published on arXiv with ID 2605.19559.

EgoCoT-Bench: New Benchmark for Egocentric Video Reasoning in MLLMs

Key facts

Entities

Institutions

Sources