MoleCode: A graph-explicit molecular language for LLMs
Researchers have introduced MoleCode, a training-free, graph-explicit molecular language designed for large language models (LLMs). Unlike SMILES, which compresses molecular topology into linear strings, MoleCode represents atoms, bonds, branches, and rings as typed entities with persistent identifiers and explicit relations. This makes molecular structure directly readable, editable, and auditable within the language context, allowing LLMs to operate on structure rather than reconstructing it from syntax. The approach improves frontier LLMs across molecular reasoning, editing, generation, and analysis tasks, particularly when structural access is limiting, such as with unfamiliar molecules. The paper is available on arXiv.
Key facts
- MoleCode is a training-free, graph-explicit molecular language for LLMs.
- It represents molecular components as typed entities with persistent identifiers and explicit relations.
- MoleCode makes molecular topology directly readable, editable, and auditable.
- It improves frontier LLMs in molecular reasoning, editing, generation, and analysis tasks.
- The improvement is strongest for unfamiliar molecules.
- The paper is available on arXiv with ID 2605.16480.
- MoleCode is an alternative to SMILES representation.
- It requires no additional training for LLMs.
Entities
Institutions
- arXiv