HyperEyes: Parallel Multimodal Search Agent with Efficiency-Aware Training
HyperEyes has been unveiled by researchers as a parallel multimodal search agent capable of handling multiple entities at once during a single interaction, unlike traditional sequential agents that address one entity per tool call. This innovative system integrates visual grounding and retrieval into a unified action, prioritizing inference efficiency in its training. Training occurs in two phases: initially, a Parallel-Amenable Data Synthesis Pipeline produces cold-start supervision data for visual multi-entity and textual multi-constraint queries, utilizing efficiency-driven paths through Progressive Rejection Sampling. A key feature is the Dual-Grained mechanism, which enhances both fine-grained and coarse-grained efficiency. This research is documented in arXiv:2605.07177.
Key facts
- HyperEyes is a parallel multimodal search agent.
- It processes multiple entities concurrently within one round.
- It fuses visual grounding and retrieval into a single atomic action.
- Inference efficiency is a first-class training objective.
- Training uses a Parallel-Amenable Data Synthesis Pipeline.
- Progressive Rejection Sampling curates efficiency-oriented trajectories.
- The central contribution is a Dual-Grained mechanism.
- Published as arXiv:2605.07177.
Entities
Institutions
- arXiv