Delulu: Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

ai-technology · 2026-05-12

Researchers introduced Delulu, a verified multi-lingual benchmark of 1,951 samples across 7 languages and 4 hallucination types for detecting code hallucinations in Fill-in-the-Middle (FIM) tasks. The benchmark targets hallucinations such as invented API methods, invalid parameters, undefined variables, and non-existent imports that pass superficial review but cause runtime errors. Samples were curated through an adversarial pipeline: a frontier LLM generated plausible hallucinations, four diverse judge models evaluated them, embedding-based clustering mined harder examples, Docker containers verified golden completions compile while hallucinated variants produce expected errors, and human-expert review removed biased or trivially decidable samples. The study evaluated 11 open-weight FIM models from five families spanning 0.5B-32B parameters. The work addresses a critical gap in code generation reliability.

Key facts

Delulu benchmark contains 1,951 FIM samples
Covers 7 programming languages
Includes 4 hallucination types
Uses adversarial pipeline with frontier LLM
Four judge models evaluate samples
Docker containers verify compilation and errors
Human-expert review final step
Evaluated 11 open-weight FIM models from 5 families

Entities

—

Sources

arXiv cs.AI — 2026-05-12