Code Language Models Fine-Tuned to Detect Cross-Language Programming Bugs
A new research paper explores using pre-trained code language models to identify cross-language bugs, which occur when multiple programming languages interact within a single project. The study fine-tuned 13 different CodeLMs on a specially constructed dataset involving three programming language combinations: Python-C/C++, Java-C/C++, and Python-Java. Researchers developed CLCFinder, a tool for identifying cross-language code, and created a dataset containing nine distinct interaction types between languages. After fine-tuning, all models showed performance improvements, with UniXcoder-base achieving the highest F1 score of 0.7407. The investigation analyzed how factors like dataset size, token sequence length, and code comments affect detection capabilities. Multilingual programming has become increasingly common due to its advantages, but it introduces bugs that traditional single-language tools struggle to detect. The paper was announced as arXiv:2507.21954v2 with a replace-cross announcement type.
Key facts
- 13 code language models were fine-tuned for cross-language bug detection
- UniXcoder-base achieved the best F1 score of 0.7407
- The dataset included three programming language combinations: Python-C/C++, Java-C/C++, and Python-Java
- Researchers developed CLCFinder for cross-language code identification
- The dataset contained nine different interaction types between programming languages
- All models showed performance improvements after fine-tuning
- Multilingual programming is increasingly common but introduces cross-language bugs
- The paper was announced as arXiv:2507.21954v2 with replace-cross type
Entities
—