LLMs Finetuned to Detect Machine-Generated Code at SemEval-2026
A team submitted systems to SemEval-2026 Task 13, which focuses on detecting machine-generated code snippets across multiple programming languages. The task includes binary detection, source attribution, generator LLM family identification, and detection of hybrid or adversarially modified code. The team adapted the existing mdok approach—originally designed for machine-generated text—by exploring base models better suited for code understanding. Their systems proved competitive across all three subtasks, though top-performing systems still outperformed them by significant margins, indicating room for improvement.
Key facts
- SemEval-2026 Task 13 addresses multi-domain detection of machine-generated code.
- The task includes binary detection, source attribution, generator LLM family detection, and hybrid/adversarial code detection.
- The submitted systems adapted the mdok approach for code-specific detection.
- Different base models were explored for better code understanding.
- Systems were competitive in all three subtasks.
- Margins from top-performing systems remain significant.
- The work is published on arXiv under Computer Science > Machine Learning.
- The paper is titled 'mcdok at SemEval-2026 Task 13: Finetuning LLMs for Detection of Machine-Generated Code'.
Entities
Institutions
- arXiv
- SemEval