VGGT-Edit: Feed-Forward Native 3D Scene Editing with Residual Field Prediction

digital · 2026-05-16

A new framework called VGGT-Edit has been introduced by researchers for editing native 3D scenes based on text input. This innovative method tackles the shortcomings of current 2D-lifting techniques by incorporating depth-synchronized text injection, which aligns semantic guidance with spatial poses. This allows for direct editing of 3D scenes without the need for optimization for each individual scene, effectively resolving issues related to blurry textures and inconsistent geometry. The findings are documented in a publication on arXiv (2605.15186).

Key facts

VGGT-Edit is a feed-forward framework for text-conditioned native 3D scene editing.
It introduces depth-synchronized text injection for spatial alignment.
Existing editing methods rely on 2D-lifting, leading to blurry textures and inconsistent geometry.
VGGT-Edit enables direct 3D editing without per-scene optimization.
Published on arXiv with ID 2605.15186.
The framework targets interactive applications requiring dynamic human instructions.
It builds on recent advances in generalizable feed-forward 3D reconstruction.
Depth-synchronized text injection aligns semantic guidance with backbone's spatial poses.

VGGT-Edit: Feed-Forward Native 3D Scene Editing with Residual Field Prediction

Key facts

Entities

Institutions

Sources