HELM Framework Addresses Long-Horizon Manipulation Deficiencies in Vision-Language-Action Models
A new framework called HELM tackles persistent failures in Vision-Language-Action models during extended manipulation tasks. These models, despite strong performance on short sequences, systematically struggle with long-horizon operations. The research identifies three core execution-loop problems: a memory gap, a verification gap, and a recovery gap. HELM's model-agnostic design incorporates an Episodic Memory Module that retrieves crucial task history using CLIP-indexed keyframes. A learned State Verifier predicts action failure before execution by analyzing observation, action, subgoal, and memory-conditioned context. This component consistently surpasses rule-based feasibility checks and ensemble uncertainty baselines. A Harness Controller handles rollback and replanning functions. The work demonstrates that merely extending context length does not resolve these deficiencies in reactive execution settings. The State Verifier represents the primary learning contribution of the framework. Its effectiveness is shown to depend on the integrated contextual analysis.
Key facts
- HELM is a model-agnostic framework for Vision-Language-Action models.
- It addresses three execution-loop deficiencies: memory gap, verification gap, recovery gap.
- The framework includes an Episodic Memory Module using CLIP-indexed keyframes.
- A learned State Verifier predicts action failure before execution.
- The State Verifier outperforms rule-based feasibility checks and ensemble uncertainty baselines.
- A Harness Controller performs rollback and replanning.
- VLA models fail systematically on long-horizon manipulation tasks despite strong short-horizon performance.
- Extending context length alone does not resolve failures in reactive execution settings.
Entities
Institutions
- arXiv