VLAs-as-Tools: A New Strategy for Long-Horizon Robot Tasks

ai-technology · 2026-05-14

A new framework called VLAs-as-Tools has been introduced by researchers, integrating a high-level vision language model (VLM) agent designed for temporal reasoning with specific vision-language-action (VLA) tools for localized tasks. The VLM is responsible for analyzing scenes, planning on a global scale, and managing recovery, while each VLA tool performs a defined subtask. An interface for the VLA tool family allows for effective replanning triggered by events, eliminating the need for constant polling of the agent. Additionally, Tool-Aligned Post-Training guarantees that VLA tools accurately respond to agent requests. This method effectively tackles the challenges of prolonged closed-loop planning and a variety of physical operations in tasks with long horizons.

Key facts

VLAs-as-Tools distributes planning and execution across a VLM agent and specialized VLA tools.
The VLM handles scene analysis, global planning, and recovery.
Each VLA tool executes a bounded subtask.
A VLA tool-family interface enables event-triggered replanning without continuous agent polling.
Tool-Aligned Post-Training ensures VLA tools follow agent invocations.
The approach targets long-horizon tasks with diverse physical operations.
The paper is available on arXiv with ID 2605.13119.
The announcement type is cross.

VLAs-as-Tools: A New Strategy for Long-Horizon Robot Tasks

Key facts

Entities

Institutions

Sources