LLMs Struggle with Combinatorial Solvers: New Benchmark CP-SynC-XL

other · 2026-05-13

A new study introduces CP-SynC-XL, a benchmark of 100 combinatorial problems with 4,577 instances, evaluating how Large Language Models (LLMs) synthesize executable solvers. Three paradigms are compared: native Python, Python with OR-Tools API, and MiniZinc with OR-Tools. Results show Python + OR-Tools achieves highest correctness, while MiniZinc + OR-Tools has lower coverage despite same back-end. Native Python often returns schema-valid but unverified solutions. The paper highlights the heuristic trap of optimizing search over formalizing solver representation.

Key facts

CP-SynC-XL benchmark contains 100 combinatorial problems and 4,577 instances.
Three solver-construction paradigms tested: native Python, Python + OR-Tools, MiniZinc + OR-Tools.
Python + OR-Tools attains highest correctness across LLMs.
MiniZinc + OR-Tools has lower absolute coverage despite using same OR-Tools back-end.
Native Python most likely to return schema-valid solution that fails verification.
Study appears on arXiv with ID 2605.12421.
LLMs struggle with direct reasoning for complex combinatorial problems.
Neuro-symbolic systems use LLMs to synthesize executable solvers.

LLMs Struggle with Combinatorial Solvers: New Benchmark CP-SynC-XL

Key facts

Entities

Institutions

Sources