ARTFEED — Contemporary Art Intelligence

QuanBench+ Benchmark Evaluates LLMs Across Three Quantum Frameworks

ai-technology · 2026-04-24

A new benchmark called QuanBench+ has been launched by researchers to assess large language models (LLMs) in the realm of quantum code generation, utilizing three frameworks: Qiskit, PennyLane, and Cirq. This benchmark features 42 tasks that are aligned with quantum algorithms, gate decomposition, and state preparation. Models undergo evaluation through executable functional tests, which yield Pass@1 and Pass@5 scores, employing KL-divergence for probabilistic outputs. The research also investigates Pass@1 scores after models are repaired based on feedback from runtime errors. The highest one-shot scores achieved are 59.5% for Qiskit, 54.8% for Cirq, and 42.9% for PennyLane. With feedback-based corrections, these scores improve to 83.3%, 76.2%, and 66.7%, respectively, demonstrating advancements but also revealing ongoing challenges in multi-framework quantum code generation.

Key facts

  • QuanBench+ spans Qiskit, PennyLane, and Cirq.
  • 42 aligned tasks cover quantum algorithms, gate decomposition, and state preparation.
  • Models evaluated with executable functional tests.
  • Pass@1 and Pass@5 reported.
  • KL-divergence-based acceptance for probabilistic outputs.
  • Feedback-based repair studied after runtime error or wrong answer.
  • Best one-shot scores: 59.5% Qiskit, 54.8% Cirq, 42.9% PennyLane.
  • Best repair scores: 83.3% Qiskit, 76.2% Cirq, 66.7% PennyLane.

Entities

Sources