MM-ToolBench: Benchmark for Omni-Modal Tool-Using Agents
A new benchmark and evaluation framework named MM-ToolBench has been developed for assessing task-oriented omni-modal tool usage. It features 100 executable tasks spanning two main macro task categories—Customer Service and Intelligent Creation—encompassing 20 subcategory slices and supported by 27 MCP servers equipped with 324 tools. This benchmark aims to bridge the divide between current evaluations that consider tool use, computer operation, and multimodal reasoning separately, and the holistic application of omni-modal tools in real-world scenarios. Its key design element is closed-loop multimodal verification, requiring agents to utilize tools, evaluate generated or altered artifacts, and make corrections if results are inadequate. This research is documented in arXiv preprint 2605.16909.
Key facts
- MM-ToolBench contains 100 executable tasks.
- Tasks come from Customer Service and Intelligent Creation families.
- 20 subcategory slices are covered.
- 27 MCP servers with 324 tools support the benchmark.
- Closed-loop multimodal verification is the core design.
- Agents must self-correct based on artifact inspection.
- Existing benchmarks evaluate tool use, computer use, and multimodal reasoning in isolation.
- The benchmark aims to bridge the gap to real-world omni-modal tool use.
Entities
Institutions
- arXiv