Token Selection Strategy for Efficient 3D Reconstruction

ai-technology · 2026-05-25

A new method reduces computational cost in visual geometry transformers for multi-view 3D reconstruction. The approach uses a two-stage token selection framework: inter-frame selection identifies key frames via diversity-based strategy, then intra-frame selection discards redundant tokens within those frames. This limits the number of key/value tokens each query interacts with during global attention, addressing the quadratic growth in cost with input sequence length. The work is published on arXiv (2605.23892) and aims to improve scalability and efficiency of feed-forward 3D attribute prediction.

Key facts

Visual geometry transformers enable joint prediction of multiple 3D attributes in a feed-forward manner.
Computational cost grows quadratically with input sequence length due to global attention layers.
The proposed strategy restricts key/value tokens per query during global attention.
Two-stage framework: inter-frame selection at frame level, intra-frame selection within selected frames.
Inter-frame selection uses a diversity-based strategy to ensure broad coverage.
Intra-frame selection discards redundant tokens within selected frames.
Published on arXiv with identifier 2605.23892.
Aims to improve scalability and efficiency of multi-view 3D reconstruction.

Token Selection Strategy for Efficient 3D Reconstruction

Key facts

Entities

Institutions

Sources