LLMs as Data Factories for Clinical Code Retrieval in Non-English Languages
A recent study published on arXiv (2605.30529) explores the potential of large generative language models to create synthetic training data aimed at enhancing clinical code retrieval for non-English languages. Researchers developed a two-stage retrieval system, consisting of a bi-encoder and a cross-encoder reranker, which was fine-tuned using a Spanish biomedical encoder (PlanTL-GOB-ES/bsc-bio-ehr-es) on synthetic data generated by Gemini. The bi-encoder achieved a Mean Reciprocal Rank (MRR) of 0.876, surpassing BioBERT-ST's 0.866, and outperformed it in Recall@3 (0.650 vs. 0.626) and Recall@5 (0.804 vs. 0.790) without prior English biomedical training. Incorporating a cross-encoder reranker raised the overall Recall@5 to 0.822, addressing the decline in recall for ICD-10-CM/CIE-10 codes beyond English, a shortfall often obscured by aggregate benchmarks.
Key facts
- arXiv paper 2605.30529 studies clinical code retrieval in non-English languages
- Uses Gemini-generated synthetic data for training
- Two-stage retriever: bi-encoder followed by cross-encoder reranker
- Base model: PlanTL-GOB-ES/bsc-bio-ehr-es (Spanish biomedical encoder)
- Languages covered: English, Spanish, Catalan, Italian, Portuguese, French
- Bi-encoder alone achieves MRR 0.876 vs BioBERT-ST 0.866
- Bi-encoder Recall@3: 0.650 vs BioBERT-ST 0.626
- Bi-encoder Recall@5: 0.804 vs BioBERT-ST 0.790
- Cross-encoder reranker lifts aggregate Recall@5 to 0.822
Entities
Institutions
- arXiv
- PlanTL-GOB-ES
- BioBERT-ST
- Gemini