Delulu: Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks
Researchers introduced Delulu, a verified multi-lingual benchmark of 1,951 samples across 7 languages and 4 hallucination types for detecting code hallucinations in Fill-in-the-Middle (FIM) tasks. The benchmark targets hallucinations such as invented API methods, invalid parameters, undefined variables, and non-existent imports that pass superficial review but cause runtime errors. Samples were curated through an adversarial pipeline: a frontier LLM generated plausible hallucinations, four diverse judge models evaluated them, embedding-based clustering mined harder examples, Docker containers verified golden completions compile while hallucinated variants produce expected errors, and human-expert review removed biased or trivially decidable samples. The study evaluated 11 open-weight FIM models from five families spanning 0.5B-32B parameters. The work addresses a critical gap in code generation reliability.
Key facts
- Delulu benchmark contains 1,951 FIM samples
- Covers 7 programming languages
- Includes 4 hallucination types
- Uses adversarial pipeline with frontier LLM
- Four judge models evaluate samples
- Docker containers verify compilation and errors
- Human-expert review final step
- Evaluated 11 open-weight FIM models from 5 families
Entities
—