LLMs Struggle with Combinatorial Solvers: New Benchmark CP-SynC-XL
A new study introduces CP-SynC-XL, a benchmark of 100 combinatorial problems with 4,577 instances, evaluating how Large Language Models (LLMs) synthesize executable solvers. Three paradigms are compared: native Python, Python with OR-Tools API, and MiniZinc with OR-Tools. Results show Python + OR-Tools achieves highest correctness, while MiniZinc + OR-Tools has lower coverage despite same back-end. Native Python often returns schema-valid but unverified solutions. The paper highlights the heuristic trap of optimizing search over formalizing solver representation.
Key facts
- CP-SynC-XL benchmark contains 100 combinatorial problems and 4,577 instances.
- Three solver-construction paradigms tested: native Python, Python + OR-Tools, MiniZinc + OR-Tools.
- Python + OR-Tools attains highest correctness across LLMs.
- MiniZinc + OR-Tools has lower absolute coverage despite using same OR-Tools back-end.
- Native Python most likely to return schema-valid solution that fails verification.
- Study appears on arXiv with ID 2605.12421.
- LLMs struggle with direct reasoning for complex combinatorial problems.
- Neuro-symbolic systems use LLMs to synthesize executable solvers.
Entities
Institutions
- arXiv