ARTFEED — Contemporary Art Intelligence

EgoCoT-Bench: New Benchmark for Egocentric Video Reasoning in MLLMs

other · 2026-05-20

Researchers have introduced EgoCoT-Bench, a benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on egocentric video understanding. The benchmark focuses on fine-grained operation-centric reasoning, requiring models to recognize hand-object interactions, track object state changes, and reason about manipulative processes from a first-person perspective. EgoCoT-Bench addresses the lack of grounded rationale evaluation in existing benchmarks by providing explicit step-by-step rationale annotations. It comprises 3,172 verifiable QA pairs over 351 egocentric videos, organized into four task groups. The work is detailed in a paper on arXiv (2605.19559).

Key facts

  • EgoCoT-Bench is a new benchmark for egocentric video understanding.
  • It targets operation-centric reasoning in MLLMs.
  • Includes 3,172 verifiable QA pairs over 351 videos.
  • Videos are separated into four task groups.
  • Provides explicit step-by-step rationale annotations.
  • Addresses limited grounded rationale evaluation in existing benchmarks.
  • Focuses on fine-grained hand-object interactions and object state changes.
  • Published on arXiv with ID 2605.19559.

Entities

Institutions

  • arXiv

Sources