RAR: Retrieving and Ranking Augmented MLLMs for Visual Recognition

ai-technology · 2026-05-18

Researchers introduce RAR, a method that combines CLIP and Multimodal Large Language Models (MLLMs) to improve few-shot and zero-shot visual recognition, particularly for datasets with extensive and fine-grained vocabularies. CLIP excels at broad recognition but struggles with fine-grained distinctions, while MLLMs handle fine-grained categories well but decline in performance as category numbers increase due to complexity and context window limits. RAR uses a multi-modal retriever based on CLIP to select relevant candidates, then ranks them using MLLMs, synergizing both strengths. The paper is published on arXiv (2403.13805) and focuses on enhancing recognition abilities without adding new information beyond the source.

Key facts

RAR stands for Retrieving And Ranking augmented method for MLLMs.
CLIP uses contrastive learning from noise image-text pairs.
CLIP excels at recognizing a wide array of candidates.
MLLMs excel at classifying fine-grained categories.
MLLMs performance declines with increase in category numbers.
RAR combines CLIP and MLLMs for few-shot/zero-shot recognition.
RAR uses a multi-modal retriever based on CLIP.
The paper is published on arXiv with ID 2403.13805.

RAR: Retrieving and Ranking Augmented MLLMs for Visual Recognition

Key facts

Entities

Institutions

Sources