LLMs as Data Factories for Clinical Code Retrieval in Non-English Languages

ai-technology · 2026-06-01

A recent study published on arXiv (2605.30529) explores the potential of large generative language models to create synthetic training data aimed at enhancing clinical code retrieval for non-English languages. Researchers developed a two-stage retrieval system, consisting of a bi-encoder and a cross-encoder reranker, which was fine-tuned using a Spanish biomedical encoder (PlanTL-GOB-ES/bsc-bio-ehr-es) on synthetic data generated by Gemini. The bi-encoder achieved a Mean Reciprocal Rank (MRR) of 0.876, surpassing BioBERT-ST's 0.866, and outperformed it in Recall@3 (0.650 vs. 0.626) and Recall@5 (0.804 vs. 0.790) without prior English biomedical training. Incorporating a cross-encoder reranker raised the overall Recall@5 to 0.822, addressing the decline in recall for ICD-10-CM/CIE-10 codes beyond English, a shortfall often obscured by aggregate benchmarks.

Key facts

arXiv paper 2605.30529 studies clinical code retrieval in non-English languages
Uses Gemini-generated synthetic data for training
Two-stage retriever: bi-encoder followed by cross-encoder reranker
Base model: PlanTL-GOB-ES/bsc-bio-ehr-es (Spanish biomedical encoder)
Languages covered: English, Spanish, Catalan, Italian, Portuguese, French
Bi-encoder alone achieves MRR 0.876 vs BioBERT-ST 0.866
Bi-encoder Recall@3: 0.650 vs BioBERT-ST 0.626
Bi-encoder Recall@5: 0.804 vs BioBERT-ST 0.790
Cross-encoder reranker lifts aggregate Recall@5 to 0.822

LLMs as Data Factories for Clinical Code Retrieval in Non-English Languages

Key facts

Entities

Institutions

Sources