Phoenix-bench: Benchmarking Agentic AI for Hardware Engineering
Phoenix-bench is an innovative benchmark designed to evaluate how well agentic AI systems, specifically for software engineering, can tackle real-world hardware engineering tasks. Unlike existing hardware LLM benchmarks that only examine isolated tasks, Phoenix-bench requires integrating various complex elements like repository navigation and Electronic Design Automation (EDA) verification. It includes a comprehensive set of 511 verified Verilator instances from 114 GitHub repositories, along with developer patches and testbenches. The researchers tested four commercial agents and eight open-source frameworks across four LLM backbones, using methods like file-level oracle localization and feedback from testbench logs. This study aims to find out if these AI systems can handle complex hardware engineering challenges effectively.
Key facts
- Phoenix-bench is a new benchmark for agentic AI in hardware engineering.
- It includes 511 verified Verilator instances from 114 GitHub repositories.
- Each instance includes developer patch, design-flow labels, testbenches, and Docker-pinned EDA environment.
- Evaluated four commercial agents and eight open-source agentic structures across four LLM backbones.
- Two diagnostic interventions were tested: file-level oracle localization and one round of testbench-log feedback.
- Existing hardware LLM benchmarks isolate sub-tasks but lack joint requirements.
- The benchmark ensures resolved-rate differences reflect agent behavior, not toolchain availability.
- The study is published on arXiv with identifier 2605.15226.
Entities
Institutions
- arXiv