EgoCoT-Bench: New Benchmark for Egocentric Video Reasoning in MLLMs
Researchers have introduced EgoCoT-Bench, a benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on egocentric video understanding. The benchmark focuses on fine-grained operation-centric reasoning, requiring models to recognize hand-object interactions, track object state changes, and reason about manipulative processes from a first-person perspective. EgoCoT-Bench addresses the lack of grounded rationale evaluation in existing benchmarks by providing explicit step-by-step rationale annotations. It comprises 3,172 verifiable QA pairs over 351 egocentric videos, organized into four task groups. The work is detailed in a paper on arXiv (2605.19559).
Key facts
- EgoCoT-Bench is a new benchmark for egocentric video understanding.
- It targets operation-centric reasoning in MLLMs.
- Includes 3,172 verifiable QA pairs over 351 videos.
- Videos are separated into four task groups.
- Provides explicit step-by-step rationale annotations.
- Addresses limited grounded rationale evaluation in existing benchmarks.
- Focuses on fine-grained hand-object interactions and object state changes.
- Published on arXiv with ID 2605.19559.
Entities
Institutions
- arXiv