LCC-LLM: Code-Centric LLM Framework for Malware Attribution
Researchers have introduced LCC-LLM, a new benchmark dataset aimed at code analysis and improving malware attribution alongside multi-task static malware analysis. The LCCD dataset comprises roughly 34,000 PE samples that have been thoroughly reverse-engineered. It includes elements like decompiled C code, assembly code, various artifacts, hexadecimal data, PE metadata, and signs of suspicious API use. Their framework merges LangGraph-driven static analysis with various cybersecurity insights, enabling evidence-based conclusions about malware. It employs a seven-layer retrieval-augmented generation approach to address the current challenges faced in LLM-based malware attribution, particularly the issues with unsupported indicators and inadequate code-level references for identifying malicious or vulnerable code.
Key facts
- LCC-LLM is a code-centric benchmark dataset and framework for malware attribution.
- The LCCD dataset contains approximately 34,000 PE samples.
- Samples are represented using decompiled C code, assembly code, CFG/FCG artifacts, hexadecimal data, PE metadata, suspicious API evidence, and structural features.
- The framework uses LangGraph-orchestrated static analysis with multi-source cybersecurity knowledge.
- It employs a seven-layer retrieval-augmented generation approach.
- Current LLM-based malware attribution is limited by unsupported indicators and insufficient code-level grounding.
- The research aims to improve identification of malicious and vulnerable code segments.
Entities
Institutions
- arXiv