DeepSeek-R1 Knowledge Distillation for Cross-Language Code Clone Detection

ai-technology · 2026-05-06

A new framework for knowledge distillation has been introduced by researchers to enhance the reasoning abilities of DeepSeek-R1, facilitating their integration into compact open-source models for cross-language code clone detection (X-CCD). This method tackles the challenges posed by large language models (LLMs) as opaque systems, such as issues with cost, reproducibility, privacy, and inconsistent output formatting. By utilizing cross-language code pairs from Project CodeNet, they generate reasoning-focused synthetic training data and refine Phi3 and Qwen-Coder using LoRA adapters. Additionally, they implement response stabilization techniques to maintain consistent binary clone label mapping. The goal of this research is to empower compact models to effectively detect semantic clones across various programming languages.

Key facts

Cross-language code clone detection (X-CCD) is challenging due to low surface similarity between semantically equivalent programs in different languages.
Large language models (LLMs) used as black-box systems raise concerns about cost, reproducibility, privacy, and unreliable output formatting.
Compact open-source models often struggle with reasoning-oriented prompts and consistent binary clone label mapping.
A knowledge distillation framework transfers reasoning capabilities from DeepSeek-R1 into compact open-source student models.
Cross-language code pairs from Project CodeNet are used to construct reasoning-oriented synthetic training data.
Phi3 and Qwen-Coder are fine-tuned with LoRA adapters.
Response stabilization methods are introduced to improve output consistency.
The framework aims to enable compact models for effective semantic clone detection across languages.

DeepSeek-R1 Knowledge Distillation for Cross-Language Code Clone Detection

Key facts

Entities

Institutions

Sources