VIGIL Framework Decouples Task Completion from Self-Termination in Embodied AI

ai-technology · 2026-05-12

A novel assessment framework named VIGIL (Verification of Goal-completion In Lifelong agents) has been launched to independently evaluate an embodied agent's proficiency in appropriately concluding a task upon its completion, referred to as 'terminal commitment.' Traditional benchmarks mix three separate failure types: failing to complete the task, finishing it but not stopping, and claiming success without adequate proof. VIGIL resolves this by requiring agents to function solely on egocentric RGB inputs without action-success feedback, concluding each episode with a semantic report validated against a concealed world state. This approach generates two distinct scores: world-state completion (W) and benchmark success (B), with B necessitating a correct terminal report. This separation allows for four outcome categories: missed execution, post-attainment drift, unsupported commitment, and verified success. The framework was outlined in a preprint on arXiv (2605.08747v1).

Key facts

VIGIL stands for Verification of Goal-completion In Lifelong agents.
It measures terminal commitment independently from world-state completion.
Standard evaluations collapse three distinct failure types into one benchmark failure.
Agents observe only egocentric RGB and receive no action-success signals.
Episodes end with a semantic report checked against hidden world state.
Two scores are produced: world-state completion (W) and benchmark success (B).
Four outcome categories are distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success.
The framework was published on arXiv under ID 2605.08747v1.

VIGIL Framework Decouples Task Completion from Self-Termination in Embodied AI

Key facts

Entities

Institutions

Sources