Research Compares Continual Pretraining and GraphRAG for Biomedical Language Models
A study explores two distinct methods for integrating structured biomedical knowledge into language models, comparing continual pretraining with Graph Retrieval-Augmented Generation. Researchers constructed a large-scale biomedical knowledge graph from the UMLS Metathesaurus, containing 3.4 million concepts and 34.2 million relations, stored in Neo4j for efficient querying. From this graph, they derived a textual corpus of approximately 100 million tokens to continually pretrain two models: BERTUMLS (starting from BERT) and BioBERTUMLS (starting from BioBERT). The research evaluates these models on six datasets from the Biomedical Language Understanding and Reasoning Benchmark (BLURB), spanning five different task types. The work investigates how structured knowledge from UMLS can enhance language models for specialized biomedical applications, moving beyond reliance on unstructured text corpora. This approach aims to improve performance on biomedical language understanding tasks through systematic knowledge injection.
Key facts
- Study explores two strategies for injecting structured biomedical knowledge into language models: continual pretraining and Graph Retrieval-Augmented Generation (GraphRAG)
- Research uses structured knowledge from UMLS Metathesaurus
- Constructed biomedical knowledge graph contains 3.4 million concepts and 34.2 million relations
- Knowledge graph stored in Neo4j for efficient querying
- Derived ~100-million-token textual corpus from knowledge graph
- Continually pretrained two models: BERTUMLS (from BERT) and BioBERTUMLS (from BioBERT)
- Evaluation conducted on six BLURB datasets spanning five task types
- arXiv paper identifier: 2604.16422v1
Entities
Institutions
- arXiv