ARTFEED — Contemporary Art Intelligence

Delulu: Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

ai-technology · 2026-05-12

Researchers introduced Delulu, a verified multi-lingual benchmark of 1,951 samples across 7 languages and 4 hallucination types for detecting code hallucinations in Fill-in-the-Middle (FIM) tasks. The benchmark targets hallucinations such as invented API methods, invalid parameters, undefined variables, and non-existent imports that pass superficial review but cause runtime errors. Samples were curated through an adversarial pipeline: a frontier LLM generated plausible hallucinations, four diverse judge models evaluated them, embedding-based clustering mined harder examples, Docker containers verified golden completions compile while hallucinated variants produce expected errors, and human-expert review removed biased or trivially decidable samples. The study evaluated 11 open-weight FIM models from five families spanning 0.5B-32B parameters. The work addresses a critical gap in code generation reliability.

Key facts

  • Delulu benchmark contains 1,951 FIM samples
  • Covers 7 programming languages
  • Includes 4 hallucination types
  • Uses adversarial pipeline with frontier LLM
  • Four judge models evaluate samples
  • Docker containers verify compilation and errors
  • Human-expert review final step
  • Evaluated 11 open-weight FIM models from 5 families

Entities

Sources