QuanBench+ Benchmark Evaluates LLMs Across Three Quantum Frameworks
A new benchmark called QuanBench+ has been launched by researchers to assess large language models (LLMs) in the realm of quantum code generation, utilizing three frameworks: Qiskit, PennyLane, and Cirq. This benchmark features 42 tasks that are aligned with quantum algorithms, gate decomposition, and state preparation. Models undergo evaluation through executable functional tests, which yield Pass@1 and Pass@5 scores, employing KL-divergence for probabilistic outputs. The research also investigates Pass@1 scores after models are repaired based on feedback from runtime errors. The highest one-shot scores achieved are 59.5% for Qiskit, 54.8% for Cirq, and 42.9% for PennyLane. With feedback-based corrections, these scores improve to 83.3%, 76.2%, and 66.7%, respectively, demonstrating advancements but also revealing ongoing challenges in multi-framework quantum code generation.
Key facts
- QuanBench+ spans Qiskit, PennyLane, and Cirq.
- 42 aligned tasks cover quantum algorithms, gate decomposition, and state preparation.
- Models evaluated with executable functional tests.
- Pass@1 and Pass@5 reported.
- KL-divergence-based acceptance for probabilistic outputs.
- Feedback-based repair studied after runtime error or wrong answer.
- Best one-shot scores: 59.5% Qiskit, 54.8% Cirq, 42.9% PennyLane.
- Best repair scores: 83.3% Qiskit, 76.2% Cirq, 66.7% PennyLane.
Entities
—