BAGEL Benchmark Introduced to Evaluate Animal Knowledge in Language Models

ai-technology · 2026-04-20

A new benchmark called BAGEL has been launched to test how well language models understand specific animal knowledge. It pulls information from various scientific sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia. BAGEL uses a closed-book format, meaning models have to rely solely on their internal knowledge without looking anything up during the assessment. It covers a range of topics related to animals, such as their classification, physical traits, habitats, behaviors, sounds, distribution, and interactions with other species. This study, found in arXiv preprint 2604.16241v1, showcases how effectively large language models can handle detailed biological information, emphasizing the knowledge they have stored within their systems.

Key facts

BAGEL is a benchmark for evaluating animal knowledge expertise in language models
The benchmark uses a closed-book evaluation protocol without external retrieval
BAGEL covers taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions
Sources include bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia
The benchmark combines curated examples and automatically generated question-answer pairs
Research addresses specialized animal knowledge in language models
Documented in arXiv preprint 2604.16241v1
Announcement type is cross-disciplinary

Entities

Institutions

bioRxiv
Global Biotic Interactions
Xeno-canto
Wikipedia
arXiv

Sources

arXiv cs.AI — 2026-04-20