EUEA Framework Enhances VLM Environmental Understanding for Embodied Agents
A new framework called Environmental Understanding Embodied Agent (EUEA) improves vision-language models (VLMs) for instruction-following embodied agents. Despite strong perception and reasoning, VLMs often fail in environmental understanding, relying on metadata. EUEA fine-tunes four core skills: object perception, task planning, action understanding, and goal recognition. It also introduces a recovery step using group relative policy optimization (GRPO). The framework enables more reliable task execution without environment metadata.
Key facts
- EUEA fine-tunes four core skills: object perception, task planning, action understanding, goal recognition.
- Framework addresses VLM limitations in environmental understanding for embodied agents.
- Includes a recovery step leveraging core skills and GRPO stage.
- Aims to reduce reliance on environment metadata during execution.
- Proposed in arXiv paper 2604.19839.
- Focuses on instruction-following embodied agents.
- VLMs show strong perception but fail on interactions.
- EUEA enables more reliable task execution.
Entities
Institutions
- arXiv