Graph-Based Evaluation Harness for Domain-Specific LLMs

ai-technology · 2026-05-18

A novel evaluation framework utilizing graph-based techniques for domain-specific language models converts structured clinical guidelines into an interactive knowledge graph, allowing for the dynamic creation of evaluation queries through graph traversal. This system provides three key assurances: comprehensive coverage of guideline relationships, resilience against surface-form contamination via combinatorial variation, and validity derived from a graph structure authored by experts. When applied to the WHO IMCI guidelines, it generates clinically relevant multiple-choice questions regarding symptom identification, treatment options, severity classification, and follow-up care. An evaluation of five language models highlights consistent performance gaps, with strengths in symptom recognition but notable deficiencies in other aspects.

Key facts

arXiv:2508.20810v3
Graph-based evaluation harness
Transforms structured clinical guidelines into queryable knowledge graph
Dynamically instantiates evaluation queries via graph traversal
Three guarantees: complete coverage, contamination resistance, validity
Applied to WHO IMCI guidelines
Generates multiple-choice questions on symptom recognition, treatment, severity classification, follow-up care
Evaluated across five language models
Models perform well on symptom recognition but show systematic capability gaps

Graph-Based Evaluation Harness for Domain-Specific LLMs

Key facts

Entities

Institutions

Sources