KnowledgeBerg Benchmark Reveals LLM Limitations in Systematic Knowledge and Compositional Reasoning

ai-technology · 2026-04-22

A new benchmark called KnowledgeBerg evaluates large language models on their ability to handle questions requiring systematic knowledge coverage and compositional set-based reasoning. The benchmark comprises 4,800 multiple-choice questions derived from 1,183 enumeration seeds across 10 domains and 17 languages. These questions are grounded in authoritative sources to ensure reproducibility. Open-source LLMs tested show significant limitations, achieving only 5.26-36.88 F1 scores on universe enumeration tasks and 16.00-44.19 accuracy on knowledge-grounded reasoning. The research formalizes this challenge through two dimensions: knowledge width (cardinality of required universe) and reasoning depth (number of compositional set operations). Diagnostic analyses identify three stages of failure: completeness (missing knowledge), awareness (failure to recognize required knowledge), and reasoning (inability to perform compositional operations). Many real-world questions appear deceptively simple but implicitly demand these capabilities, a phenomenon described as "the tip of the iceberg." The benchmark was announced on arXiv with identifier arXiv:2604.17621v1.

Key facts

KnowledgeBerg is a benchmark of 4,800 multiple-choice questions
Questions derived from 1,183 enumeration seeds across 10 domains and 17 languages
Open-source LLMs achieved 5.26-36.88 F1 on universe enumeration
LLMs achieved 16.00-44.19 accuracy on knowledge-grounded reasoning
Benchmark formalizes challenge through knowledge width and reasoning depth
Diagnostic analyses reveal three stages of failure: completeness, awareness, reasoning
Questions are grounded in authoritative sources for reproducibility
Announced on arXiv with identifier arXiv:2604.17621v1

KnowledgeBerg Benchmark Reveals LLM Limitations in Systematic Knowledge and Compositional Reasoning

Key facts

Entities

Institutions

Sources