EgoBench: New Benchmark Tests AI Agents in Real-World Tool Use
Researchers have introduced EgoBench, the first interactive multimodal benchmark designed to evaluate tool-using AI agents in open, real-world environments. The benchmark comprises 1,045 egocentric-video-grounded tasks spanning four daily scenarios, along with a user-agent-tool interactive environment. A three-stage synergistic pipeline ensures each task enforces joint application of visual perception and tool-augmented multi-hop reasoning. A multi-agent simulated user provides natural and task-constrained feedback, enabling objective evaluation of dynamic interaction. Existing benchmarks fail to jointly assess multimodal perception, tool invocation with multi-hop reasoning, and dynamic user interaction due to challenges in designing coupled multi-capability tasks and simulating realistic feedback. EgoBench aims to bridge this gap by providing a strictly coupled evaluation framework. The work is detailed in a paper on arXiv (2605.27820).
Key facts
- EgoBench is the first interactive multimodal benchmark for tool-using agents
- Comprises 1,045 egocentric-video-grounded tasks
- Covers four daily scenarios
- Includes a user-agent-tool interactive environment
- Uses a three-stage synergistic pipeline for task design
- Employs a multi-agent simulated user for feedback
- Evaluates multimodal perception, tool invocation, and dynamic interaction
- Paper available on arXiv with ID 2605.27820
Entities
Institutions
- arXiv