BAGEL Benchmark Introduced to Evaluate Animal Knowledge in Language Models
A new benchmark called BAGEL has been launched to test how well language models understand specific animal knowledge. It pulls information from various scientific sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia. BAGEL uses a closed-book format, meaning models have to rely solely on their internal knowledge without looking anything up during the assessment. It covers a range of topics related to animals, such as their classification, physical traits, habitats, behaviors, sounds, distribution, and interactions with other species. This study, found in arXiv preprint 2604.16241v1, showcases how effectively large language models can handle detailed biological information, emphasizing the knowledge they have stored within their systems.
Key facts
- BAGEL is a benchmark for evaluating animal knowledge expertise in language models
- The benchmark uses a closed-book evaluation protocol without external retrieval
- BAGEL covers taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions
- Sources include bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia
- The benchmark combines curated examples and automatically generated question-answer pairs
- Research addresses specialized animal knowledge in language models
- Documented in arXiv preprint 2604.16241v1
- Announcement type is cross-disciplinary
Entities
Institutions
- bioRxiv
- Global Biotic Interactions
- Xeno-canto
- Wikipedia
- arXiv