LLM Safety Benchmarks: Code Quality and Community Adoption Study

ai-technology · 2026-05-18

A comprehensive evaluation of 31 LLM safety benchmarks—focused on prompt injection, jailbreak, and hallucination—alongside 382 non-benchmark papers as a control group, uncovers notable shortcomings in code quality and functionality. This research integrates automated static analysis, over 220 hours of human runnability testing, and bibliometric analysis. Results indicate that only 39% of benchmark repositories operate without changes, while 16% offer error-free installation instructions, and just 6% consider ethical implications despite containing potentially dangerous content. This study aims to fill the void in the systematic evaluation of benchmark code quality and the elements influencing community acceptance.

Key facts

31 LLM safety benchmarks analyzed
382 non-benchmark papers as control group
220+ person-hours of human runnability testing
39% of repositories run without modification
16% have flawless installation guides
6% include ethical considerations
Covers prompt injection, jailbreak, and hallucination
Combines automated static analysis, human testing, and bibliometric analysis

LLM Safety Benchmarks: Code Quality and Community Adoption Study

Key facts

Entities

Institutions

Sources