STAR Framework Repairs LLM-Based RCA Agents in Microservices
Researchers have introduced STAR (Stage-attributed Triage and Repair), a framework designed to fix errors in LLM-based root cause analysis (RCA) agents used for incident diagnosis in microservice AIOps. STAR decomposes the RCA workflow into four stages: Evidence Package (EP), Hypothesis Set (HS), Analysis Structure (AS), and Decision Report (DR), treating agent failures as stage-localizable bugs rather than monolithic errors. Built on LangGraph, it performs stage-wise auditing, budget-aware Fast/Slow Routing, and decisive stage localization via counterfactual candidate evaluation. The framework aims to enhance reliability by preventing error propagation through reasoning traces. The paper is available on arXiv under ID 2605.15581.
Key facts
- STAR stands for Stage-attributed Triage and Repair
- Targets LLM-based RCA agents in microservice AIOps
- Decomposes RCA into four stages: EP, HS, AS, DR
- Built on top of LangGraph
- Uses stage-wise auditing and Fast/Slow Routing
- Employs counterfactual candidate evaluation for stage localization
- Aims to prevent error propagation in reasoning traces
- Published on arXiv with ID 2605.15581
Entities
Institutions
- arXiv