TRACER: Semantic-Aware Framework for Code LLM Contamination Detection
A team of researchers has created TRACER, a framework designed to detect data contamination in large language models (LLMs) with a focus on semantics. Unlike conventional approaches that prioritize exact matches, TRACER assesses contamination through three semantic dimensions: Functionally Identical, Nearly Identical, and Shared Logic. It utilizes a coarse-to-fine detection pipeline. Additionally, the researchers established the first benchmark for fine-grained code contamination detection, encompassing three popular benchmarks and three representative post-training datasets. TRACER exhibited impressive results across various LLM architectures, with GPT-5 achieving an F1 score of 0.91 for fine-grained detection and 0.92 for binary detection, surpassing existing techniques by 42%-217%. Its effectiveness was further confirmed through ablation studies and error analysis.
Key facts
- TRACER is a semantic-aware framework for fine-grained code contamination detection in code LLMs.
- It models contamination at three levels: Functionally Identical, Nearly Identical, and Shared Logic.
- Detection uses a coarse-to-fine pipeline.
- First benchmark for fine-grained code contamination detection introduced.
- Benchmark spans three widely used benchmarks and three post-training datasets.
- GPT-5 achieved F1 of 0.91 in fine-grained detection.
- Binary detection F1 of 0.92, outperforming existing methods by 42%-217%.
- Ablation studies and error analysis conducted.
Entities
Institutions
- arXiv