LLM-Generated Code Often Contains Vulnerable Library Versions
A large-scale measurement study of 10 large language models (LLMs) on the PinTrace benchmark reveals that LLM-specified library versions in Python code frequently include known vulnerabilities. When directly prompted, models specified version identifiers 26.83% to 95.18% of the time, but only 6.45% to 59.19% when creating a manifest file. Among specified versions, 36.70% to 55.70% of tasks contained at least one known CVE, with 62.75% to 74.51% rated Critical or High severity. In 72.27% to 91.37% of cases, the vulnerabilities were publicly disclosed before the model's knowledge cutoff. The study, published on arXiv (2605.06279), is the first systematic measurement of version-level risk in LLM-generated code.
Key facts
- Study evaluated 10 LLMs on PinTrace benchmark of 1,000 Stack Overflow tasks
- LLMs specified version identifiers 26.83%-95.18% when directly prompted
- Only 6.45%-59.19% specified versions when creating a manifest file
- 36.70%-55.70% of tasks had at least one known CVE
- 62.75%-74.51% of CVEs were Critical or High severity
- 72.27%-91.37% of CVEs disclosed before model's knowledge cutoff
- First large-scale measurement of version-level risk in LLM-generated Python code
- Paper published on arXiv with ID 2605.06279
Entities
Institutions
- arXiv