MMCORE: Unified Framework for Multimodal Image Generation and Editing
MMCORE serves as a cohesive system for generating and editing multimodal images, utilizing a pre-trained Vision-Language Model (VLM) to forecast semantic visual embeddings through adjustable query tokens. These embeddings guide a diffusion model, enabling the transfer of VLM's reasoning abilities into visual creation. The architecture circumvents extensive integration of autoregressive and diffusion models, as well as the need for training from the ground up, thus minimizing computational demands while ensuring high-quality synthesis. By combining text-to-image generation with concurrent image creation, MMCORE excels in spatial reasoning and visual grounding. Performance assessments indicate it surpasses leading benchmarks across various tasks.
Key facts
- MMCORE is a unified framework for multimodal image generation and editing.
- It uses a pre-trained Vision-Language Model (VLM) to predict semantic visual embeddings via learnable query tokens.
- The embeddings serve as conditioning signals for a diffusion model.
- The design avoids deep fusion between autoregressive and diffusion models or training from scratch.
- It reduces computational overhead while maintaining high-fidelity synthesis.
- MMCORE integrates text-to-image synthesis with interleaved image generation.
- It demonstrates robust multimodal comprehension in spatial reasoning and visual grounding.
- Comprehensive evaluations show MMCORE consistently outperforms state-of-the-art baselines.
Entities
—