IntentVLM: AI Framework for Human Intention Recognition in Robotics
A two-stage video-language framework called IntentVLM has been created by researchers to enhance human intention recognition in open-vocabulary settings, thereby improving interactions between humans and robots. Drawing inspiration from cognitive science's forward-inverse modeling, this system breaks down intention comprehension into two parts: generating goal candidates and performing structured inference through selection, which minimizes hallucinations in latent reasoning. When tested on the IntentQA and Inst-IT Bench datasets, IntentVLM achieved remarkable results, reaching up to 80% accuracy, which is 30% better than baseline performance and comparable to human results. This framework utilizes multimodal signals, including text and visual information, to accurately interpret user intent. The research is documented in a paper available on arXiv under ID 2604.24002.
Key facts
- IntentVLM is a two-stage video-language framework for open-vocabulary human intention recognition.
- The approach is inspired by forward-inverse modeling in cognitive science.
- It decomposes intention understanding into goal candidate generation and structured inference through selection.
- The system reduces hallucinations in latent reasoning.
- Evaluated on IntentQA and Inst-IT Bench datasets.
- Achieves up to 80% accuracy, surpassing baseline by 30%.
- Matches human performance.
- Published on arXiv with ID 2604.24002.
Entities
Institutions
- arXiv