EuropeMedQA: Multilingual Medical Exam Dataset for LLM Evaluation

ai-technology · 2026-04-27

The EuropeMedQA study protocol presents the inaugural extensive multilingual and multimodal dataset for medical examinations, derived from official regulatory assessments in Italy, France, Spain, and Portugal. This dataset is crafted to test large language models (LLMs) on tasks involving non-English languages and visual diagnostics, adhering to FAIR data principles and SPIRIT-AI guidelines. It outlines a meticulous curation process alongside an automated translation system for comparative evaluation. Current multimodal LLMs are evaluated using a zero-shot approach with strict prompting to assess cross-lingual transfer and visual reasoning. EuropeMedQA seeks to establish a benchmark that is resistant to contamination, accurately representing European clinical practices and enhancing the generalizability of medical AI.

Key facts

EuropeMedQA is the first comprehensive multilingual and multimodal medical examination dataset.
Dataset sourced from official regulatory exams in Italy, France, Spain, and Portugal.
Follows FAIR data principles and SPIRIT-AI guidelines.
Includes an automated translation pipeline for comparative analysis.
Evaluates multimodal LLMs using zero-shot, strictly constrained prompting.
Aims to assess cross-lingual transfer and visual reasoning.
Designed as a contamination-resistant benchmark.
Reflects complexity of European clinical practices.

Entities

Locations

Italy
France
Spain
Portugal
Europe

Sources

arXiv cs.AI — 2026-04-27