KnowledgeBerg Benchmark Reveals LLM Limitations in Systematic Knowledge and Compositional Reasoning
A new benchmark called KnowledgeBerg evaluates large language models on their ability to handle questions requiring systematic knowledge coverage and compositional set-based reasoning. The benchmark comprises 4,800 multiple-choice questions derived from 1,183 enumeration seeds across 10 domains and 17 languages. These questions are grounded in authoritative sources to ensure reproducibility. Open-source LLMs tested show significant limitations, achieving only 5.26-36.88 F1 scores on universe enumeration tasks and 16.00-44.19 accuracy on knowledge-grounded reasoning. The research formalizes this challenge through two dimensions: knowledge width (cardinality of required universe) and reasoning depth (number of compositional set operations). Diagnostic analyses identify three stages of failure: completeness (missing knowledge), awareness (failure to recognize required knowledge), and reasoning (inability to perform compositional operations). Many real-world questions appear deceptively simple but implicitly demand these capabilities, a phenomenon described as "the tip of the iceberg." The benchmark was announced on arXiv with identifier arXiv:2604.17621v1.
Key facts
- KnowledgeBerg is a benchmark of 4,800 multiple-choice questions
- Questions derived from 1,183 enumeration seeds across 10 domains and 17 languages
- Open-source LLMs achieved 5.26-36.88 F1 on universe enumeration
- LLMs achieved 16.00-44.19 accuracy on knowledge-grounded reasoning
- Benchmark formalizes challenge through knowledge width and reasoning depth
- Diagnostic analyses reveal three stages of failure: completeness, awareness, reasoning
- Questions are grounded in authoritative sources for reproducibility
- Announced on arXiv with identifier arXiv:2604.17621v1
Entities
Institutions
- arXiv