Metric Selection for Root Cause Analysis of Cloud Infrastructure
Summary
Cloud-native systems composed of microservices provide a lot of telemetry data, such as log messages and time-series metrics. Accurately interpreting this data to find performance problems remains a challenge. As services emit a lot of metrics and log data at the pod level, it’s important to find relevant trends to make sure the system is responsive and reliable. In this work, we use the LEMMA-RCA dataset — a benchmark dataset containing structured logs and resource metrics from Kubernetes pods — to explore two main objectives: first, predicting service latency using historical pod-level metrics, and second, attributing resource usage spikes — such as CPU and memory — to specific log templates. We propose using supervised learning methods, primarily tree-based models, to figure out how to link telemetry data to performance outcomes in an interpretable way. The models show which indicators and log patterns are most closely linked to infrastructure load. These contributions support proactive diagnostics in complex microservice contexts and show how explainable AI can help with automated root cause analysis.