NVIDIA Fine-Tunes Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation

ai-technology · 2026-05-18

NVIDIA has published a guide aimed at fine-tuning its 2B-parameter Cosmos Predict 2.5 world model, which is designed for generating robot videos using LoRA and DoRA techniques. This model can create videos from text, images, or clips and is specifically tailored for robotic manipulation. To mitigate the high costs and potential for forgetting associated with full fine-tuning, LoRA and DoRA utilize small trainable adapters within frozen layers, enabling training on a single GPU. The guide incorporates diffusers and accelerate libraries, necessitating a minimum of one 80 GB GPU, with 8× H100s being preferable. Training involves 92 videos of robot manipulation and 50 (prompt, image) pairs, showing improved metrics after 100 epochs. This guide is part of NVIDIA's Cosmos Cookbook, available on Hugging Face and GitHub.

Key facts

NVIDIA Cosmos Predict 2.5 is a large-scale world model for generating physically plausible videos.
LoRA and DoRA enable parameter-efficient fine-tuning on a single GPU.
Training dataset: 92 robot manipulation videos with text prompts.
Test dataset: 50 (prompt, image) pairs.
Training for 100 epochs on 8× H100s takes ~2.5 hours.
Fine-tuning improves Sampson Error, physical plausibility, and instruction following.
LoRA rank 32 boosts instruction following; rank 8 suffices for geometric consistency.
DoRA may stabilize training at low ranks.

NVIDIA Fine-Tunes Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation

Key facts

Entities

Institutions

Sources