Ilov3Splat: Open-Vocabulary 3D Scene Understanding via Gaussian Splatting
Researchers introduced Ilov3Splat, a framework for instance-level open-vocabulary 3D scene understanding using 3D Gaussian Splatting (3D-GS). Unlike prior methods relying on 2D rendering or point-level semantic association, Ilov3Splat jointly optimizes geometry and semantics by augmenting Gaussian splats with view-consistent feature fields. It uses multi-resolution hash embedding to encode CLIP features for dense language grounding in 3D space, and trains an instance feature field with contrastive loss over SAM masks for fine-grained object distinction. At inference, CLIP queries are matched against learned features with two-stage 3D clustering. The paper is available on arXiv.
Key facts
- Ilov3Splat is a framework for instance-level open-vocabulary 3D scene understanding.
- It is built on 3D Gaussian Splatting (3D-GS).
- Prior work depends on 2D rendering-based matching or point-level semantic association.
- The method jointly optimizes scene geometry and semantic representations.
- It augments Gaussian splats with view-consistent feature fields.
- Multi-resolution hash embedding encodes language-aligned CLIP features.
- Instance feature field is trained using contrastive loss over SAM masks.
- Inference uses CLIP-encoded queries matched against learned features with two-stage 3D clustering.
Entities
Institutions
- arXiv