ARTFEED — Contemporary Art Intelligence

HWE-Bench: New Benchmark Tests LLMs on Hardware Bug Repair

other · 2026-04-25

A team of researchers has launched HWE-Bench, marking the first extensive benchmark at the repository level aimed at assessing large language models (LLMs) in the context of real-world hardware bug repair tasks. This benchmark features 417 task instances sourced from actual historical bug-fix pull requests across six prominent open-source projects, including RISC-V cores, SoCs, and security roots-of-trust, utilizing Verilog/SystemVerilog and Chisel hardware description languages. Each task operates within a fully containerized setup, requiring the LLM agent to address a genuine bug report, with accuracy verified through the project's native simulation and regression processes. The benchmark's creation involved a largely automated pipeline that allows for seamless expansion to additional repositories. Evaluations included seven LLMs across four agent configurations. This research fills a notable gap in current benchmarks, which often focus on isolated, component-level tasks, neglecting repository-scale evaluations. The study can be found on arXiv with the identifier 2604.14709.

Key facts

  • HWE-Bench is the first large-scale, repository-level benchmark for LLM agents on hardware bug repair.
  • It contains 417 task instances from real historical bug-fix pull requests.
  • Covers six major open-source projects including RISC-V cores, SoCs, and security roots-of-trust.
  • Projects span Verilog/SystemVerilog and Chisel hardware description languages.
  • Each task is in a containerized environment with validation via native simulation and regression flows.
  • Built with an automated pipeline for efficient expansion to new repositories.
  • Seven LLMs were evaluated with four agent configurations.
  • Addresses lack of repository-scale evaluation in existing hardware design benchmarks.

Entities

Institutions

  • arXiv

Sources