ReXSonoVQA: Benchmarking VLMs on Ultrasound Video Understanding

ai-technology · 2026-05-01

Researchers have launched a new benchmark called ReXSonoVQA aimed at assessing vision-language models (VLMs) in ultrasound applications. This data set consists of 514 video clips, with corresponding questions—249 in a multiple-choice format and 265 as open-ended queries. The benchmark focuses on three significant aspects: understanding action goals, resolving artifacts, and contextual planning of procedures. Evaluations conducted on various models, including Gemini 3 Pro, Qwen 3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro, highlighted that while these models acquired some procedural understanding, they faced challenges in troubleshooting scenarios, showing minimal enhancements over traditional text-based assessments.

Key facts

ReXSonoVQA is a video QA benchmark for procedure-centric ultrasound understanding.
It contains 514 video clips and 514 questions (249 MCQ, 265 free-response).
Three competencies are targeted: Action-Goal Reasoning, Artifact Resolution & Optimization, Procedure Context & Planning.
Models evaluated: Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, Seed 2.0 Pro.
VLMs show limited performance on troubleshooting questions.
Minimal gains over text-only baselines in causal reasoning.
Benchmark aims to enable autonomous ultrasound systems.
Published on arXiv (2604.10916).

ReXSonoVQA: Benchmarking VLMs on Ultrasound Video Understanding

Key facts

Entities

Institutions

Sources