ARTFEED — Contemporary Art Intelligence

IntentVLM: AI Framework for Human Intention Recognition in Robotics

ai-technology · 2026-04-29

A two-stage video-language framework called IntentVLM has been created by researchers to enhance human intention recognition in open-vocabulary settings, thereby improving interactions between humans and robots. Drawing inspiration from cognitive science's forward-inverse modeling, this system breaks down intention comprehension into two parts: generating goal candidates and performing structured inference through selection, which minimizes hallucinations in latent reasoning. When tested on the IntentQA and Inst-IT Bench datasets, IntentVLM achieved remarkable results, reaching up to 80% accuracy, which is 30% better than baseline performance and comparable to human results. This framework utilizes multimodal signals, including text and visual information, to accurately interpret user intent. The research is documented in a paper available on arXiv under ID 2604.24002.

Key facts

  • IntentVLM is a two-stage video-language framework for open-vocabulary human intention recognition.
  • The approach is inspired by forward-inverse modeling in cognitive science.
  • It decomposes intention understanding into goal candidate generation and structured inference through selection.
  • The system reduces hallucinations in latent reasoning.
  • Evaluated on IntentQA and Inst-IT Bench datasets.
  • Achieves up to 80% accuracy, surpassing baseline by 30%.
  • Matches human performance.
  • Published on arXiv with ID 2604.24002.

Entities

Institutions

  • arXiv

Sources