CRAFT Benchmark Tests AI Pragmatic Communication in Partial Information
Researchers have introduced CRAFT, a new benchmark designed to evaluate how well large language models communicate pragmatically when they only have partial information. This setup includes various agents, each with distinct yet incomplete viewpoints, who need to work together using natural language to build a shared 3D structure that none of them can fully see. It’s essentially a multi-sender Bounded Pragmatic Speaker problem. They’ve developed a diagnostic framework to identify failures related to spatial understanding, belief modeling, and communication issues, while also classifying behavioral failures in both frontier and open-weight models. Testing 8 open-weight and 7 frontier models revealed that better reasoning doesn’t always improve teamwork; sometimes, smaller models perform just as well as or better than frontier ones, and individual communication skills don't guarantee group success.
Key facts
- CRAFT is a multi-agent benchmark for pragmatic communication under partial information.
- Multiple agents with complementary but incomplete views must coordinate via natural language.
- The task is to construct a shared 3D structure unobservable by any single agent.
- The problem is formalized as a multi-sender Bounded Pragmatic Speaker problem.
- Failures are decomposed into spatial grounding, belief modeling, and pragmatic communication errors.
- A taxonomy of behavioral failure profiles is provided for frontier and open-weight models.
- 8 open-weight and 7 frontier models were tested, including reasoning models.
- Stronger reasoning ability does not reliably lead to better coordination.
- Smaller open-weight models often match or outperform frontier systems.
- Improved individual communication does not guarantee group success.
Entities
Institutions
- arXiv