QuanBench+ Benchmark Evaluates LLMs Across Three Quantum Frameworks

ai-technology · 2026-04-24

A new benchmark called QuanBench+ has been launched by researchers to assess large language models (LLMs) in the realm of quantum code generation, utilizing three frameworks: Qiskit, PennyLane, and Cirq. This benchmark features 42 tasks that are aligned with quantum algorithms, gate decomposition, and state preparation. Models undergo evaluation through executable functional tests, which yield Pass@1 and Pass@5 scores, employing KL-divergence for probabilistic outputs. The research also investigates Pass@1 scores after models are repaired based on feedback from runtime errors. The highest one-shot scores achieved are 59.5% for Qiskit, 54.8% for Cirq, and 42.9% for PennyLane. With feedback-based corrections, these scores improve to 83.3%, 76.2%, and 66.7%, respectively, demonstrating advancements but also revealing ongoing challenges in multi-framework quantum code generation.

Key facts

QuanBench+ spans Qiskit, PennyLane, and Cirq.
42 aligned tasks cover quantum algorithms, gate decomposition, and state preparation.
Models evaluated with executable functional tests.
Pass@1 and Pass@5 reported.
KL-divergence-based acceptance for probabilistic outputs.
Feedback-based repair studied after runtime error or wrong answer.
Best one-shot scores: 59.5% Qiskit, 54.8% Cirq, 42.9% PennyLane.
Best repair scores: 83.3% Qiskit, 76.2% Cirq, 66.7% PennyLane.

Entities

—

Sources

arXiv cs.AI — 2026-04-23