ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
A new paper on arXiv has introduced ReTool-Video, a framework designed for a video agent that uses tools recursively to enhance video understanding. This approach addresses two main issues with existing video agents: a vast range of tools that don't provide the detailed operations needed for complex reasoning, and a basic action space that reduces higher-level intentions to simple tool functions. To solve these problems, the authors created the MetaAug-Video Tool Library (MVTL), featuring 134 tools in total—26 core tools for general multimodal processing and 108 meta tools for various tasks like filtering and formatting. MVTL allows for enhanced access to structured video data, fostering improved reasoning and question answering. The paper can be found on arXiv under the identifier 2605.13228.
Key facts
- Paper introduces ReTool-Video, a recursive tool-using video agent framework.
- Addresses limitations in existing video agents: coarse tool space and flat action space.
- Proposes MetaAug-Video Tool Library (MVTL) with 134 tools.
- MVTL includes 26 base tools and 108 meta tools.
- Tools support filtering, aggregation, reranking, formatting, and other operations.
- Framework aims to improve temporal reasoning, cross-modal understanding, and question answering.
- Paper available on arXiv with ID 2605.13228.
- Published as a cross-type announcement.
Entities
Institutions
- arXiv