JoyAI-Image: A Unified Multimodal Model for Visual Understanding and Generation
JoyAI-Image is an integrated multimodal foundation model that combines visual comprehension, text-to-image creation, and guided image editing. It features a spatially enhanced Multimodal Large Language Model (MLLM) alongside a Multimodal Diffusion Transformer (MMDiT), allowing for interaction between perception and generation via a common multimodal interface. The architecture benefits from a scalable training methodology that includes unified instruction tuning, supervision for long-text rendering, spatially grounded data, and signals for both general and spatial editing. This configuration enhances geometry-aware reasoning and controllable visual synthesis. Evaluations across various benchmarks for understanding, generation, long-text rendering, and editing demonstrate either state-of-the-art or highly competitive results, marking a notable leap in spatial intelligence for multimodal AI.
Key facts
- JoyAI-Image is a unified multimodal foundation model.
- It handles visual understanding, text-to-image generation, and instruction-guided image editing.
- It couples a spatially enhanced MLLM with a Multimodal Diffusion Transformer (MMDiT).
- Perception and generation interact through a shared multimodal interface.
- Training includes unified instruction tuning, long-text rendering supervision, spatially grounded data, and editing signals.
- The model achieves state-of-the-art or highly competitive performance on multiple benchmarks.
- The bidirectional loop between understanding and generation enhances spatial intelligence.
- The paper is available on arXiv under ID 2605.04128.
Entities
Institutions
- arXiv