Research Compares Continual Pretraining and GraphRAG for Biomedical Language Models

ai-technology · 2026-04-22

A study explores two distinct methods for integrating structured biomedical knowledge into language models, comparing continual pretraining with Graph Retrieval-Augmented Generation. Researchers constructed a large-scale biomedical knowledge graph from the UMLS Metathesaurus, containing 3.4 million concepts and 34.2 million relations, stored in Neo4j for efficient querying. From this graph, they derived a textual corpus of approximately 100 million tokens to continually pretrain two models: BERTUMLS (starting from BERT) and BioBERTUMLS (starting from BioBERT). The research evaluates these models on six datasets from the Biomedical Language Understanding and Reasoning Benchmark (BLURB), spanning five different task types. The work investigates how structured knowledge from UMLS can enhance language models for specialized biomedical applications, moving beyond reliance on unstructured text corpora. This approach aims to improve performance on biomedical language understanding tasks through systematic knowledge injection.

Key facts

Study explores two strategies for injecting structured biomedical knowledge into language models: continual pretraining and Graph Retrieval-Augmented Generation (GraphRAG)
Research uses structured knowledge from UMLS Metathesaurus
Constructed biomedical knowledge graph contains 3.4 million concepts and 34.2 million relations
Knowledge graph stored in Neo4j for efficient querying
Derived ~100-million-token textual corpus from knowledge graph
Continually pretrained two models: BERTUMLS (from BERT) and BioBERTUMLS (from BioBERT)
Evaluation conducted on six BLURB datasets spanning five task types
arXiv paper identifier: 2604.16422v1

Research Compares Continual Pretraining and GraphRAG for Biomedical Language Models

Key facts

Entities

Institutions

Sources