Graph-Based Evaluation Harness for Domain-Specific LLMs
A novel evaluation framework utilizing graph-based techniques for domain-specific language models converts structured clinical guidelines into an interactive knowledge graph, allowing for the dynamic creation of evaluation queries through graph traversal. This system provides three key assurances: comprehensive coverage of guideline relationships, resilience against surface-form contamination via combinatorial variation, and validity derived from a graph structure authored by experts. When applied to the WHO IMCI guidelines, it generates clinically relevant multiple-choice questions regarding symptom identification, treatment options, severity classification, and follow-up care. An evaluation of five language models highlights consistent performance gaps, with strengths in symptom recognition but notable deficiencies in other aspects.
Key facts
- arXiv:2508.20810v3
- Graph-based evaluation harness
- Transforms structured clinical guidelines into queryable knowledge graph
- Dynamically instantiates evaluation queries via graph traversal
- Three guarantees: complete coverage, contamination resistance, validity
- Applied to WHO IMCI guidelines
- Generates multiple-choice questions on symptom recognition, treatment, severity classification, follow-up care
- Evaluated across five language models
- Models perform well on symptom recognition but show systematic capability gaps
Entities
Institutions
- World Health Organization
- WHO IMCI