ReXSonoVQA: Benchmarking VLMs on Ultrasound Video Understanding
Researchers have launched a new benchmark called ReXSonoVQA aimed at assessing vision-language models (VLMs) in ultrasound applications. This data set consists of 514 video clips, with corresponding questions—249 in a multiple-choice format and 265 as open-ended queries. The benchmark focuses on three significant aspects: understanding action goals, resolving artifacts, and contextual planning of procedures. Evaluations conducted on various models, including Gemini 3 Pro, Qwen 3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro, highlighted that while these models acquired some procedural understanding, they faced challenges in troubleshooting scenarios, showing minimal enhancements over traditional text-based assessments.
Key facts
- ReXSonoVQA is a video QA benchmark for procedure-centric ultrasound understanding.
- It contains 514 video clips and 514 questions (249 MCQ, 265 free-response).
- Three competencies are targeted: Action-Goal Reasoning, Artifact Resolution & Optimization, Procedure Context & Planning.
- Models evaluated: Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, Seed 2.0 Pro.
- VLMs show limited performance on troubleshooting questions.
- Minimal gains over text-only baselines in causal reasoning.
- Benchmark aims to enable autonomous ultrasound systems.
- Published on arXiv (2604.10916).
Entities
Institutions
- arXiv