SLO-aware IAC Automation Framework for Dynamic Cloud Deployment.
Summary
According to O’Reilly’s 2021 report [1], over 90% of companies worldwide utilize cloud computing, highlighting its critical role in the IT industry. To aid developers in managing cloud infrastructure, the paradigm of
Infrastructure as Code (IaC) has emerged [2], allowing infrastructure to
be defined and maintained through code. Designing infrastructure, however, requires extensive expertise, and in most business scenarios, it must
adhere to certain constraints known as Service Level Objectives (SLOs).
These SLOs impose limits on Service Level Indicators (SLIs) such as CPU
usage, memory consumption, and uptime.
This thesis explores two frameworks aimed at automating IaC creation
while meeting defined SLOs, leveraging Large Language Models (LLMs)
and statistical prediction methods. The first framework uses manually
defined SLOs to guide the LLM in adjusting CPU and memory allocations,
with the goal of achieving target performance while predicting potential
SLO violations. The second framework uses statistical methods to derive
SLOs from observed metrics, which are then used to iteratively refine the
IaC through an LLM in pursuit of desired performance levels.
Both frameworks were evaluated against a baseline. In no case did
adjusting the infrastructure for SLO compliance result in performance
matching that of the baseline. The first framework, which relies on manual
SLO definitions, experienced several SLO violations after 3 LLM adjustments. Its best performance reached only 22% of the target throughput
(131 RPS vs. 600 RPS). Conversely, the second framework, based on
metric-driven SLOs, achieved up to 79% (476 rps vs. 600 rps) of the target throughput without violating any SLOs after three LLM-guided code
adjustments. However, this improvement came at the cost of increased
average response times and a significant rise in failed requests.
Additionally, it was observed that prompt design greatly impacts the
quality of the IaC output. When specific SLOs are provided for individual
services, the LLM tends to overemphasize those services while neglecting
others. What initially appears to be helpful information can quickly overwhelm the LLM and degrade output quality. Nevertheless, the findings
suggest that with the insights gained, both frameworks can be further
refined to yield improved results.