StepFly Framework Automates IT Troubleshooting Guides Using LLMs
A new agentic framework called StepFly has been developed to automate troubleshooting guides for incident management in large-scale IT systems. The approach addresses limitations in existing LLM-based solutions, which struggle with TSG quality issues, complex control flow interpretation, data-intensive queries, and execution parallelism. Researchers conducted an empirical study analyzing 92 real-world TSGs to inform their methodology. StepFly implements a three-stage workflow beginning with TSG Mentor, a tool that helps site reliability engineers improve guide quality. The second stage involves offline preprocessing using LLMs to extract structured execution information. This research was published on arXiv under identifier 2510.10074v2. The framework represents an end-to-end solution designed to overcome the slow and error-prone nature of manual TSG execution. By automating these processes, StepFly aims to enhance incident management efficiency in complex IT environments.
Key facts
- StepFly is an agentic framework for automating troubleshooting guides
- Existing LLM solutions lack support for TSG quality issues and complex control flow
- Researchers analyzed 92 real-world TSGs in an empirical study
- The framework features a three-stage workflow
- First stage includes TSG Mentor tool for SREs to improve guide quality
- Second stage performs offline preprocessing using LLMs
- Research published on arXiv with identifier 2510.10074v2
- Framework addresses slow and error-prone manual TSG execution
Entities
Institutions
- arXiv