LLM Safety Benchmarks: Code Quality and Community Adoption Study
A comprehensive evaluation of 31 LLM safety benchmarks—focused on prompt injection, jailbreak, and hallucination—alongside 382 non-benchmark papers as a control group, uncovers notable shortcomings in code quality and functionality. This research integrates automated static analysis, over 220 hours of human runnability testing, and bibliometric analysis. Results indicate that only 39% of benchmark repositories operate without changes, while 16% offer error-free installation instructions, and just 6% consider ethical implications despite containing potentially dangerous content. This study aims to fill the void in the systematic evaluation of benchmark code quality and the elements influencing community acceptance.
Key facts
- 31 LLM safety benchmarks analyzed
- 382 non-benchmark papers as control group
- 220+ person-hours of human runnability testing
- 39% of repositories run without modification
- 16% have flawless installation guides
- 6% include ethical considerations
- Covers prompt injection, jailbreak, and hallucination
- Combines automated static analysis, human testing, and bibliometric analysis
Entities
Institutions
- arXiv