Pelican-Unified 1.0: Embodied AI Model Unifies Understanding, Reasoning, Imagination, and Action
Researchers have just introduced Pelican-Unified 1.0, the first-ever embodied foundation model built on unification ideas. It employs a single Vision-Language Model (VLM) that combines various elements like scenes, instructions, and visual contexts into a cohesive understanding module. Additionally, it has a reasoning module that provides task-oriented thought sequences all in one go. The final output is turned into a dense latent variable, which the Unified Future Generator (UFG) uses to simultaneously produce future videos and actions through specialized outputs in a unified process. By optimizing losses related to language, video, and action, this model represents a significant step forward in achieving unified embodied intelligence.
Key facts
- Pelican-Unified 1.0 is the first embodied foundation model trained on the principle of unification.
- Uses a single VLM as both understanding and reasoning module.
- Maps scenes, instructions, visual contexts, and action histories into a shared semantic space.
- Autoregressively produces task-, action-, and future-oriented chains of thought in a single forward pass.
- Final hidden state projects into a dense latent variable.
- Unified Future Generator (UFG) jointly generates future videos and actions.
- Language, video, and action losses are backpropagated into the shared representation.
- Jointly optimizes understanding, reasoning, imagination, and action during training.
Entities
—