ORCA Framework Enhances Vision-Language Model Accuracy and Adversarial Robustness
The ORCA framework introduces a novel approach to agentic reasoning, tackling reliability challenges in Large Vision-Language Models (LVLMs). Although these models exhibit impressive multimodal abilities, they are prone to hallucinations stemming from inherent errors and susceptibility to adversarial threats. ORCA enhances factual precision and resilience against adversarial attacks by employing structured reasoning at inference time, utilizing smaller vision models with fewer than 3B parameters. It functions through an Observe-Reason-Critique-Act cycle, engaging various visual tools with evidential inquiries and checking for inconsistencies across models. Predictions are iteratively refined without needing internal model access or retraining. This framework also archives intermediate reasoning for transparent decision-making and was detailed on arXiv under the identifier arXiv:2509.15435v2.
Key facts
- ORCA is an agentic reasoning framework for Large Vision-Language Models
- LVLMs exhibit strong multimodal capabilities but have reliability limitations
- The framework improves factual accuracy and adversarial robustness
- ORCA uses small vision models with less than 3B parameters
- It operates via an Observe-Reason-Critique-Act loop
- The system queries multiple visual tools with evidential questions
- ORCA validates cross-model inconsistencies and refines predictions iteratively
- The framework stores intermediate reasoning traces for auditable decision-making
Entities
Institutions
- arXiv