Pelican-Unified 1.0: Embodied AI Model Unifies Understanding, Reasoning, Imagination, and Action

ai-technology · 2026-05-16

Researchers have just introduced Pelican-Unified 1.0, the first-ever embodied foundation model built on unification ideas. It employs a single Vision-Language Model (VLM) that combines various elements like scenes, instructions, and visual contexts into a cohesive understanding module. Additionally, it has a reasoning module that provides task-oriented thought sequences all in one go. The final output is turned into a dense latent variable, which the Unified Future Generator (UFG) uses to simultaneously produce future videos and actions through specialized outputs in a unified process. By optimizing losses related to language, video, and action, this model represents a significant step forward in achieving unified embodied intelligence.

Key facts

Pelican-Unified 1.0 is the first embodied foundation model trained on the principle of unification.
Uses a single VLM as both understanding and reasoning module.
Maps scenes, instructions, visual contexts, and action histories into a shared semantic space.
Autoregressively produces task-, action-, and future-oriented chains of thought in a single forward pass.
Final hidden state projects into a dense latent variable.
Unified Future Generator (UFG) jointly generates future videos and actions.
Language, video, and action losses are backpropagated into the shared representation.
Jointly optimizes understanding, reasoning, imagination, and action during training.

Entities

—

Sources

arXiv cs.AI — 2026-05-16