Metric Selection for Root Cause Analysis of Cloud Infrastructure

Lozovska, Alona

View/Open

Applied_Data_Science_master_thesis_Latency_prediction_Alona_Lozovska.pdf (1.228Mb)

Publication date

2025

Author

Lozovska, Alona

Metadata

Show full item record

Summary

Cloud-native systems composed of microservices provide a lot of telemetry data, such as log messages and time-series metrics. Accurately interpreting this data to find performance problems remains a challenge. As services emit a lot of metrics and log data at the pod level, it’s important to find relevant trends to make sure the system is responsive and reliable. In this work, we use the LEMMA-RCA dataset — a benchmark dataset containing structured logs and resource metrics from Kubernetes pods — to explore two main objectives: first, predicting service latency using historical pod-level metrics, and second, attributing resource usage spikes — such as CPU and memory — to specific log templates. We propose using supervised learning methods, primarily tree-based models, to figure out how to link telemetry data to performance outcomes in an interpretable way. The models show which indicators and log patterns are most closely linked to infrastructure load. These contributions support proactive diagnostics in complex microservice contexts and show how explainable AI can help with automated root cause analysis.

URI

https://studenttheses.uu.nl/handle/20.500.12932/50487

Collections

Theses