LocateAnything: Parallel Box Decoding Boosts VLM Grounding Speed and Accuracy
A new framework called LocateAnything has been unveiled by researchers, integrating generative grounding and detection through Parallel Box Decoding (PBD). This method decodes bounding boxes and points simultaneously as atomic units, maintaining geometric coherence within boxes and allowing for significant parallel processing. As a result, it enhances both decoding speed and localization precision, outpacing conventional coordinate-token generation techniques that convert 2D boxes into several 1D tokens. Additionally, the team created a scalable data engine and assembled LocateAnything-Data, a comprehensive dataset featuring over 138 million training samples. Details of this research can be found in arXiv paper 2605.27365v1.
Key facts
- LocateAnything uses Parallel Box Decoding (PBD) to decode geometric elements as atomic units in a single step.
- PBD preserves intra-box geometric coherence and unlocks substantial parallelism.
- The framework improves both decoding throughput and localization accuracy.
- A scalable data engine was developed to curate LocateAnything-Data.
- LocateAnything-Data contains more than 138 million training samples.
- The paper is available on arXiv with ID 2605.27365v1.
- Traditional VLMs serialize 2D boxes into multiple 1D tokens for decoding.
- Token-by-token decoding mismatches the coupled structure of box geometry.
Entities
Institutions
- arXiv