MedFact: Chinese Medical Fact-Checking Benchmark for LLMs

other · 2026-06-01

MedFact has been developed by researchers as a benchmark to assess the fact-checking abilities of large language models (LLMs) specifically for Chinese medical literature. This dataset includes 2,116 instances that have been annotated by experts from various real-world contexts, encompassing 13 medical specialties, 8 types of errors, 4 writing styles, and 5 levels of difficulty. A combination of AI and human input, featuring iterative feedback from experts, was employed to maintain high quality. When evaluating 20 prominent LLMs, it was found that while these models can identify errors, they often fail to localize them accurately, with the best models not achieving human-level precision. Additionally, the research uncovered a tendency for models to incorrectly label accurate information as false.

Key facts

MedFact is a Chinese medical fact-checking benchmark.
It contains 2,116 expert-annotated instances.
Covers 13 specialties, 8 error types, 4 writing styles, and 5 difficulty levels.
Construction used a hybrid AI-human framework.
20 leading LLMs were evaluated on veracity classification and error localization.
Models often detect errors but struggle to localize them precisely.
Top performers fall short of human performance.
"Over-criticism" phenomenon: models misidentify correct information as erroneous.

Entities

—

Sources

arXiv cs.AI — 2026-06-01