OpenClassGen: 324,843 Python Classes for LLM Training
OpenClassGen has been introduced by researchers as a comprehensive dataset comprising 324,843 authentic Python classes sourced from 2,970 open-source projects. In contrast to synthetic benchmarks such as ClassEval, which includes 100 classes, and RealClassEval with 400 classes, OpenClassGen offers complete class skeletons that feature signatures and docstrings, thus removing the necessity for context at the repository level. Each class entry is supplemented with 27 static code metrics that assess complexity, coupling, cohesion, and inheritance. This dataset is designed to facilitate thorough evaluation and training of LLMs. A selected group of 300 executable classes was utilized to evaluate GPT-o4-mini, Claude-4-Sonnet, and Qwen-3-Coder. The corpus can be accessed on arXiv.
Key facts
- OpenClassGen contains 324,843 Python classes from 2,970 open-source projects.
- Each entry includes a human-written class and its skeleton with signatures and docstrings.
- 27 static code metrics are provided per class.
- Prior benchmarks: ClassEval (100 synthetic classes) and RealClassEval (400 classes).
- Three LLMs evaluated: GPT-o4-mini, Claude-4-Sonnet, Qwen-3-Coder.
- Subset of 300 executable classes used for evaluation.
- No repository-level context resolution needed.
- Published on arXiv with ID 2504.15564.
Entities
—