New AI Benchmark SGMRI-VQA Introduces Multi-Frame Spatial Reasoning for Medical Imaging
A novel benchmark named Spatially Grounded MRI Visual Question Answering (SGMRI-VQA) has been introduced to assess vision-language models in the realm of volumetric medical imaging. Comprising 41,307 question-answer pairs derived from expert annotations by radiologists in the fastMRI+ dataset, it encompasses MRI studies of the brain and knee. Unlike traditional benchmarks that concentrate on single 2D images, SGMRI-VQA tackles the volumetric characteristics of clinical imaging, where findings may extend across multiple frames or be limited to a few slices. Each QA pair features a clinician-aligned reasoning trace with indexed bounding box coordinates, ensuring clarity in reasoning and spatial context. The tasks are organized hierarchically, focusing on detection, localization, counting/classification, and captioning, requiring models to integrate reasoning about presence, location, and frame context. This initiative, announced on arXiv with identifier arXiv:2604.15808v1, aims to enhance spatial reasoning and visual grounding in medical vision-language models.
Key facts
- SGMRI-VQA is a 41,307-pair benchmark for multi-frame, spatially grounded reasoning on volumetric MRI
- Built from expert radiologist annotations in the fastMRI+ dataset across brain and knee studies
- Each QA pair includes a clinician-aligned chain-of-thought trace with frame-indexed bounding box coordinates
- Tasks are organized hierarchically across detection, localization, counting/classification, and captioning
- Addresses limitations of existing benchmarks that evaluate VLMs on isolated 2D images
- Announced on arXiv with identifier arXiv:2604.15808v1 as cross-announcement type
- Requires models to jointly reason about what is present, where it is, and across which frames
- Targets spatial reasoning and visual grounding capabilities for vision-language models in medical contexts
Entities
Institutions
- arXiv