PennyLang Dataset Released to Improve LLM-Based Quantum Code Generation
To tackle the challenges associated with utilizing large language models for quantum software development, a new dataset named PennyLang has been launched. This dataset features 3,347 quantum code samples specific to PennyLane, complete with contextual descriptions, sourced from textbooks, official documentation, and open-source repositories. PennyLang is designed to function as both a training dataset for LLMs and a dependable resource for quantum programming tasks. Its development includes three key components: the dataset itself, an automated framework for creating quantum code datasets, and baseline evaluations across various open models. Released as open-source, this resource aims to enhance research and development in quantum computing. The findings are detailed in arXiv preprint 2503.02497v4.
Key facts
- PennyLang dataset contains 3,347 PennyLane-specific quantum code samples
- Dataset curated from textbooks, official documentation, and open-source repositories
- Designed to improve LLM-based quantum code generation
- Includes contextual descriptions for code samples
- Released as open-source resource
- Framework enables automated quantum code dataset construction
- Addresses lack of high-quality datasets for quantum software development
- Research documented in arXiv preprint 2503.02497v4
Entities
Institutions
- arXiv