Image Generators as Generalist Vision Learners
A recent study published on arXiv (2604.20329) reveals that training in image generation, akin to the pretraining of large language models (LLMs), allows models to acquire robust visual representations. The researchers present Vision Banana, a versatile model created by instruction-tuning Nano Banana Pro (NBP) using a blend of original training data alongside vision task data. By representing vision task outputs as RGB images, this model attains top performance across multiple vision tasks, demonstrating that generative vision models can cultivate significant understanding abilities.
Key facts
- arXiv paper 2604.20329
- Image generators exhibit zero-shot visual understanding
- Vision Banana model introduced
- Built by instruction-tuning Nano Banana Pro (NBP)
- Output space of vision tasks parameterized as RGB images
- Achieves SOTA performance on various vision tasks
- Generative pretraining analogous to LLM pretraining
- Limited evidence previously existed for generative vision understanding
Entities
Institutions
- arXiv