SecureVibeBench: Benchmarking AI Code Security via Real Vulnerabilities
Researchers have unveiled SecureVibeBench, a benchmark comprising 105 secure coding tasks in C/C++ drawn from 41 projects within OSS-Fuzz. This benchmark aims to assess code agents utilizing large language models by recreating situations where human developers have inadvertently added vulnerabilities. It includes authentic multi-file modifications in extensive repositories, contextual alignments based on actual open-source vulnerabilities with clearly defined introduction points, and a thorough evaluation that merges functionality testing with security assessments using both static and dynamic oracles. Five widely-used code agents were tested. This initiative fills a gap in current benchmarks that overlook scenarios of human-introduced vulnerabilities, facilitating equitable comparisons between human developers and AI agents.
Key facts
- SecureVibeBench includes 105 C/C++ secure coding tasks
- Tasks sourced from 41 projects in OSS-Fuzz
- Benchmark reconstructs vulnerability-introducing scenarios by human developers
- Requires multi-file edits in large repositories
- Uses real-world open-source vulnerabilities with precisely identified introduction points
- Evaluation combines functionality testing and security checking with static and dynamic oracles
- Five popular code agents were evaluated
- Addresses gap in existing benchmarks for fair human-AI comparison
Entities
Institutions
- arXiv
- OSS-Fuzz