Bernini: MLLM-Based Semantic Planning for Video Diffusion
Introducing Bernini, a novel framework that integrates multimodal large language models (MLLMs) with diffusion models for the purposes of video creation and modification. This system allocates tasks efficiently: an MLLM-driven planner forecasts desired semantic representations within the ViT embedding space, whereas a DiT-based renderer generates pixels based on this blueprint, enhanced by text attributes and source VAE features to maintain detail. The planner and renderer can be trained independently, as semantics act as the connecting interface. This research paper can be found on arXiv with the identifier 2605.22344.
Key facts
- Bernini is a unified framework for video generation and editing.
- It uses an MLLM-based planner for semantic planning.
- The planner predicts target semantic representations in the ViT embedding space.
- A DiT-based renderer synthesizes pixels conditioned on the plan.
- The renderer is augmented by text features and source VAE features for editing.
- The planner and renderer can be trained separately.
- The paper is on arXiv with ID 2605.22344.
- The approach divides labor between MLLMs and diffusion models.
Entities
Institutions
- arXiv