New 3D Autoregressive Diffusion Model Generates Complex Scenes from Text Descriptions

ai-technology · 2026-04-22

A novel generative model, 3D-ARD+, enables sequential text-to-scene generation by unifying autoregressive and diffusion processes. It addresses limitations in current approaches that often produce simple layouts or inconsistent objects. The model generates both scene layouts and objects, handling non-trivial descriptions of shape, appearance, and spatial arrangement. This paradigm shift supports interactive scene creation, reducing manual effort in 3D scene production. The research, detailed in arXiv preprint 2604.16552v1, was announced as a cross-disciplinary study. Recent methods have largely focused on either layout or object generation, but few integrate both effectively. The core innovation involves generating coarse-grained 3D latents in scene space conditioned on text input. This advancement marks a step toward more coherent and complex 3D scene synthesis from textual prompts.

Key facts

A new paradigm for sequential text-to-scene generation is introduced.
The model 3D-ARD+ unifies autoregressive and diffusion generation.
It generates both scene layouts and objects from text descriptions.
Current approaches often produce simple layouts or inconsistent objects.
The research is detailed in arXiv preprint 2604.16552v1.
The announcement type is cross-disciplinary.
The model conditions on text input for shape, appearance, and spatial arrangement.
It aims to reduce manual efforts in 3D scene creation.

New 3D Autoregressive Diffusion Model Generates Complex Scenes from Text Descriptions

Key facts

Entities

Institutions

Sources