ATLAS: A Single Word for Agentic and Latent Visual Reasoning
Researchers propose ATLAS, a framework that uses a single discrete 'word' called a functional token to combine agentic and latent visual reasoning. Agentic reasoning through code or tool calls suffers from context-switching latency, while latent reasoning with learnable embeddings lacks task generalization and is hard to train with autoregressive parallelization. ATLAS addresses these limitations by associating each functional token with an internalized visual operation that requires no visual supervision. The framework aims to unify the strengths of both approaches without their drawbacks. The paper is available on arXiv under identifier 2605.15198.
Key facts
- ATLAS is a framework for visual reasoning.
- It uses a single discrete 'word' called a functional token.
- The functional token serves as both an agentic operation and a latent visual reasoning unit.
- Agentic reasoning incurs context-switching latency from external execution.
- Latent reasoning lacks task generalization and is difficult to train with autoregressive parallelization.
- Each functional token is associated with an internalized visual operation.
- The framework requires no visual supervision.
- The paper is published on arXiv with ID 2605.15198.
Entities
Institutions
- arXiv