Dutch medical language corpus released on Hugging Face

other · 2026-04-30

Researchers have created the first large-scale Dutch medical language corpus, comprising approximately 35 billion tokens across 100 million documents. The corpus was built by translating English datasets, identifying medical text in generic corpora, and extracting open Dutch medical resources. It is freely available on Hugging Face for pre-training and downstream NLP tasks, addressing the scarcity of Dutch medical corpora.

Key facts

Dutch medical corpora are scarce, limiting NLP development.
Methods include translating English datasets, identifying medical text in generic corpora, and extracting open Dutch medical resources.
The corpus comprises approximately 35 billion tokens.
The corpus spans about 100 million documents.
The corpus is freely available on Hugging Face.
This is the first large-scale Dutch medical language corpus.
The corpus is intended for pre-training and downstream NLP tasks.

Dutch medical language corpus released on Hugging Face

Key facts

Entities

Institutions

Sources