EvoScene-VLA: Action-Updated Scene Beliefs for Robot Control
EvoScene-VLA has unveiled a dynamic, continuously updated scene state tailored for chunked robotic control. Traditional VLA policies forecast multi-step actions based exclusively on current visual inputs; however, actions taken by the robot can lead to contact, occlusion, and object movement, which modifies the geometry that subsequent decisions rely on before the next visual refresh. While spatial VLAs enhance the geometry of the current frame and temporal VLAs compile information from previous frames, neither effectively sustains an action-updated scene state across chunks. The recurrent scene prefix in EvoScene-VLA preserves a geometry-aware scene state throughout control calls. During each VLM invocation, the model merges scene data from the latest observation with the prior action-updated state from the previous chunk. The action decoder then generates both the subsequent action chunk and a concise scene update, which serves as the new prior for the VLM to adjust against the fresh observation. This methodology fosters the evolution of scene beliefs within the action decoder, thereby enhancing long-horizon manipulation tasks.
Key facts
- EvoScene-VLA introduces a persistent action-updated scene state across control chunks.
- Standard VLA policies rely only on current visual observations for each multi-step action chunk.
- Robot actions cause contact, occlusion, and object motion, changing scene geometry.
- Spatial VLAs improve current-frame geometry; temporal VLAs aggregate past frames.
- Neither spatial nor temporal VLAs maintain an action-updated scene prior across chunks.
- EvoScene-VLA uses a recurrent scene prefix to carry a geometry-aware scene state.
- At each VLM call, the model combines current observation with action-updated prior from previous chunk.
- The action decoder outputs both the next action chunk and a compact scene update.
Entities
Institutions
- arXiv