Ultrasound VQA Enhanced by Active Zooming and Uncertainty Awareness
A new framework for ultrasound visual question answering (VQA) improves Vision-Language Model (VLM) performance by mimicking sonographers' cognitive workflow. The approach introduces a Zoom-then-Diagnose paradigm that interactively focuses on lesion regions before diagnosis, addressing the lack of structured lesion-focused reasoning in existing VLMs. Additionally, it incorporates uncertainty-aware rewards within the Group Relative Policy Optimization (GRPO) framework to account for the inherent subjectivity and ambiguity in medical annotations, rather than treating them as unbiased ground truths. This work, published as arXiv:2605.21652, targets suboptimal VLM performance in ultrasound by replicating the interactive search process of clinical practice.
Key facts
- Proposes Zoom-then-Diagnose paradigm for lesion-focused reasoning
- Uses uncertainty-aware rewards in GRPO framework
- Addresses subjectivity in medical annotations
- Targets ultrasound VQA performance improvement
- Published as arXiv:2605.21652
- Replicates sonographer's cognitive workflow
Entities
Institutions
- arXiv