HWE-Bench: New Benchmark Tests LLMs on Hardware Bug Repair
A team of researchers has launched HWE-Bench, marking the first extensive benchmark at the repository level aimed at assessing large language models (LLMs) in the context of real-world hardware bug repair tasks. This benchmark features 417 task instances sourced from actual historical bug-fix pull requests across six prominent open-source projects, including RISC-V cores, SoCs, and security roots-of-trust, utilizing Verilog/SystemVerilog and Chisel hardware description languages. Each task operates within a fully containerized setup, requiring the LLM agent to address a genuine bug report, with accuracy verified through the project's native simulation and regression processes. The benchmark's creation involved a largely automated pipeline that allows for seamless expansion to additional repositories. Evaluations included seven LLMs across four agent configurations. This research fills a notable gap in current benchmarks, which often focus on isolated, component-level tasks, neglecting repository-scale evaluations. The study can be found on arXiv with the identifier 2604.14709.
Key facts
- HWE-Bench is the first large-scale, repository-level benchmark for LLM agents on hardware bug repair.
- It contains 417 task instances from real historical bug-fix pull requests.
- Covers six major open-source projects including RISC-V cores, SoCs, and security roots-of-trust.
- Projects span Verilog/SystemVerilog and Chisel hardware description languages.
- Each task is in a containerized environment with validation via native simulation and regression flows.
- Built with an automated pipeline for efficient expansion to new repositories.
- Seven LLMs were evaluated with four agent configurations.
- Addresses lack of repository-scale evaluation in existing hardware design benchmarks.
Entities
Institutions
- arXiv