ARTFEED — Contemporary Art Intelligence

Graph-Based Evaluation Harness for Domain-Specific LLMs

ai-technology · 2026-05-18

A novel evaluation framework utilizing graph-based techniques for domain-specific language models converts structured clinical guidelines into an interactive knowledge graph, allowing for the dynamic creation of evaluation queries through graph traversal. This system provides three key assurances: comprehensive coverage of guideline relationships, resilience against surface-form contamination via combinatorial variation, and validity derived from a graph structure authored by experts. When applied to the WHO IMCI guidelines, it generates clinically relevant multiple-choice questions regarding symptom identification, treatment options, severity classification, and follow-up care. An evaluation of five language models highlights consistent performance gaps, with strengths in symptom recognition but notable deficiencies in other aspects.

Key facts

  • arXiv:2508.20810v3
  • Graph-based evaluation harness
  • Transforms structured clinical guidelines into queryable knowledge graph
  • Dynamically instantiates evaluation queries via graph traversal
  • Three guarantees: complete coverage, contamination resistance, validity
  • Applied to WHO IMCI guidelines
  • Generates multiple-choice questions on symptom recognition, treatment, severity classification, follow-up care
  • Evaluated across five language models
  • Models perform well on symptom recognition but show systematic capability gaps

Entities

Institutions

  • World Health Organization
  • WHO IMCI

Sources