VLAs-as-Tools: A New Strategy for Long-Horizon Robot Tasks
A new framework called VLAs-as-Tools has been introduced by researchers, integrating a high-level vision language model (VLM) agent designed for temporal reasoning with specific vision-language-action (VLA) tools for localized tasks. The VLM is responsible for analyzing scenes, planning on a global scale, and managing recovery, while each VLA tool performs a defined subtask. An interface for the VLA tool family allows for effective replanning triggered by events, eliminating the need for constant polling of the agent. Additionally, Tool-Aligned Post-Training guarantees that VLA tools accurately respond to agent requests. This method effectively tackles the challenges of prolonged closed-loop planning and a variety of physical operations in tasks with long horizons.
Key facts
- VLAs-as-Tools distributes planning and execution across a VLM agent and specialized VLA tools.
- The VLM handles scene analysis, global planning, and recovery.
- Each VLA tool executes a bounded subtask.
- A VLA tool-family interface enables event-triggered replanning without continuous agent polling.
- Tool-Aligned Post-Training ensures VLA tools follow agent invocations.
- The approach targets long-horizon tasks with diverse physical operations.
- The paper is available on arXiv with ID 2605.13119.
- The announcement type is cross.
Entities
Institutions
- arXiv