LCC-LLM: Code-Centric LLM Framework for Malware Attribution

other · 2026-05-09

Researchers have introduced LCC-LLM, a new benchmark dataset aimed at code analysis and improving malware attribution alongside multi-task static malware analysis. The LCCD dataset comprises roughly 34,000 PE samples that have been thoroughly reverse-engineered. It includes elements like decompiled C code, assembly code, various artifacts, hexadecimal data, PE metadata, and signs of suspicious API use. Their framework merges LangGraph-driven static analysis with various cybersecurity insights, enabling evidence-based conclusions about malware. It employs a seven-layer retrieval-augmented generation approach to address the current challenges faced in LLM-based malware attribution, particularly the issues with unsupported indicators and inadequate code-level references for identifying malicious or vulnerable code.

Key facts

LCC-LLM is a code-centric benchmark dataset and framework for malware attribution.
The LCCD dataset contains approximately 34,000 PE samples.
Samples are represented using decompiled C code, assembly code, CFG/FCG artifacts, hexadecimal data, PE metadata, suspicious API evidence, and structural features.
The framework uses LangGraph-orchestrated static analysis with multi-source cybersecurity knowledge.
It employs a seven-layer retrieval-augmented generation approach.
Current LLM-based malware attribution is limited by unsupported indicators and insufficient code-level grounding.
The research aims to improve identification of malicious and vulnerable code segments.

LCC-LLM: Code-Centric LLM Framework for Malware Attribution

Key facts

Entities

Institutions

Sources