BioConCal: Supervised Scorer for Biomedical NER Candidate Verification

other · 2026-06-01

BioConCal, a newly established benchmark and supervised scoring system, tackles the issue of validating biomedical named entity recognition (NER) candidates generated by various large language models (LLMs). This benchmark consolidates predictions from eight LLMs into a master table using five public biomedical NER datasets. By employing inference-time features such as agreement, mention, surface-availability, and document characteristics, BioConCal scores the candidates. It enhances the AUROC from 0.753 (based on raw agreement) to 0.910 in-domain. This research underscores that while multi-LLM agreement serves as a salient signal, it does not ensure correctness according to corpus conventions, owing to differences in annotation practices, span boundaries, entity granularity, and type schemas.

Key facts

BioConCal is an in-domain supervised scorer for panel-surfaced candidate verification.
The benchmark aligns eight LLMs' predictions over five public biomedical NER datasets.
BioConCal improves AUROC from 0.753 to 0.910 in-domain.
Multi-LLM agreement is a salience signal, not corpus-convention correctness.
Features include agreement, mention, surface-availability, and document features.
The benchmark uses a candidate master table from aligned predictions.
Biomedical NER is deceptively simple for modern LLMs.
Corpus-convention correctness depends on annotation conventions, span boundaries, entity granularity, and type schemas.

Entities

—

Sources

arXiv cs.AI — 2026-06-01