ARTFEED — Contemporary Art Intelligence

LLMs Struggle with Combinatorial Solvers: New Benchmark CP-SynC-XL

other · 2026-05-13

A new study introduces CP-SynC-XL, a benchmark of 100 combinatorial problems with 4,577 instances, evaluating how Large Language Models (LLMs) synthesize executable solvers. Three paradigms are compared: native Python, Python with OR-Tools API, and MiniZinc with OR-Tools. Results show Python + OR-Tools achieves highest correctness, while MiniZinc + OR-Tools has lower coverage despite same back-end. Native Python often returns schema-valid but unverified solutions. The paper highlights the heuristic trap of optimizing search over formalizing solver representation.

Key facts

  • CP-SynC-XL benchmark contains 100 combinatorial problems and 4,577 instances.
  • Three solver-construction paradigms tested: native Python, Python + OR-Tools, MiniZinc + OR-Tools.
  • Python + OR-Tools attains highest correctness across LLMs.
  • MiniZinc + OR-Tools has lower absolute coverage despite using same OR-Tools back-end.
  • Native Python most likely to return schema-valid solution that fails verification.
  • Study appears on arXiv with ID 2605.12421.
  • LLMs struggle with direct reasoning for complex combinatorial problems.
  • Neuro-symbolic systems use LLMs to synthesize executable solvers.

Entities

Institutions

  • arXiv

Sources