ORCA Framework Enhances Vision-Language Model Accuracy and Adversarial Robustness

ai-technology · 2026-04-22

The ORCA framework introduces a novel approach to agentic reasoning, tackling reliability challenges in Large Vision-Language Models (LVLMs). Although these models exhibit impressive multimodal abilities, they are prone to hallucinations stemming from inherent errors and susceptibility to adversarial threats. ORCA enhances factual precision and resilience against adversarial attacks by employing structured reasoning at inference time, utilizing smaller vision models with fewer than 3B parameters. It functions through an Observe-Reason-Critique-Act cycle, engaging various visual tools with evidential inquiries and checking for inconsistencies across models. Predictions are iteratively refined without needing internal model access or retraining. This framework also archives intermediate reasoning for transparent decision-making and was detailed on arXiv under the identifier arXiv:2509.15435v2.

Key facts

ORCA is an agentic reasoning framework for Large Vision-Language Models
LVLMs exhibit strong multimodal capabilities but have reliability limitations
The framework improves factual accuracy and adversarial robustness
ORCA uses small vision models with less than 3B parameters
It operates via an Observe-Reason-Critique-Act loop
The system queries multiple visual tools with evidential questions
ORCA validates cross-model inconsistencies and refines predictions iteratively
The framework stores intermediate reasoning traces for auditable decision-making

ORCA Framework Enhances Vision-Language Model Accuracy and Adversarial Robustness

Key facts

Entities

Institutions

Sources