LLMs Achieve 89% Accuracy in Early-Stage Product Line Validation
A recent study released on arXiv investigates the capability of Large Language Models (LLMs) to execute feature model analysis tasks using semi-formal textual blueprints for early validation in Software Product Line scoping. The research assessed 12 advanced LLMs against 16 standard analysis tasks, comparing their results to the solver-based oracle FLAMA. Models optimized for reasoning, such as Grok 4 Fast Reasoning and Gemini 2.5 Pro, achieved an average accuracy of 88-89% across all operations and blueprints, nearing the correctness of solvers. The study revealed systematic errors in structural parsing and constraint reasoning, emphasizing accuracy-cost trade-offs to guide model selection. While LLMs show potential as lightweight tools for early variability validation, they are not yet substitutes for formal solvers.
Key facts
- Study tests LLMs on feature model analysis operations using semi-formal textual blueprints.
- 12 state-of-the-art LLMs and 16 standard analysis operations were evaluated.
- Outputs were compared against the solver-based oracle FLAMA.
- Grok 4 Fast Reasoning and Gemini 2.5 Pro achieved 88-89% average accuracy.
- Systematic errors in structural parsing and constraint reasoning were identified.
- Accuracy-cost trade-offs inform model selection.
- LLMs are positioned as lightweight assistants for early variability validation.
- Published on arXiv under Computer Science > Software Engineering.
Entities
Institutions
- arXiv
- FLAMA