HWE-Bench: New Benchmark Tests LLMs on Hardware Bug Repair

other · 2026-04-25

A team of researchers has launched HWE-Bench, marking the first extensive benchmark at the repository level aimed at assessing large language models (LLMs) in the context of real-world hardware bug repair tasks. This benchmark features 417 task instances sourced from actual historical bug-fix pull requests across six prominent open-source projects, including RISC-V cores, SoCs, and security roots-of-trust, utilizing Verilog/SystemVerilog and Chisel hardware description languages. Each task operates within a fully containerized setup, requiring the LLM agent to address a genuine bug report, with accuracy verified through the project's native simulation and regression processes. The benchmark's creation involved a largely automated pipeline that allows for seamless expansion to additional repositories. Evaluations included seven LLMs across four agent configurations. This research fills a notable gap in current benchmarks, which often focus on isolated, component-level tasks, neglecting repository-scale evaluations. The study can be found on arXiv with the identifier 2604.14709.

Key facts

HWE-Bench is the first large-scale, repository-level benchmark for LLM agents on hardware bug repair.
It contains 417 task instances from real historical bug-fix pull requests.
Covers six major open-source projects including RISC-V cores, SoCs, and security roots-of-trust.
Projects span Verilog/SystemVerilog and Chisel hardware description languages.
Each task is in a containerized environment with validation via native simulation and regression flows.
Built with an automated pipeline for efficient expansion to new repositories.
Seven LLMs were evaluated with four agent configurations.
Addresses lack of repository-scale evaluation in existing hardware design benchmarks.

HWE-Bench: New Benchmark Tests LLMs on Hardware Bug Repair

Key facts

Entities

Institutions

Sources