In this chapter we survey related systems that solve similar problems to our problem of tuning cloud based workloads. In §2.1 we first introduce relevant work from the cloud resource management community. Then in §2.2 we explain related work that uses representation learning to characterize workloads. Finally, in §2.3 we cover three related systems from the machine learning (ML) community that solve a similar problem when tuning hyper-parameters of ML models. We discuss in the next subsections the similarities and diﬀerences between these systems and our work.
Learning based cloud resource management systems
Ottertune. Ottertune [61, 68] is a state of the art tuning tool for database management systems (DBMs) published in 2017. It oﬀers a range of ML techniques with the purpose of automating the tedious task of tuning the large database knobs and cutting the expensive cost of manual tuning. We give a detailed description of Ottertune in this section, because it’s the main relevant system to which we compare quantitatively our work on both modeling performance and in an end-to-end manner in Chapter 7 (in particular under §7.6).
Ottertune goes beyond previous work in the literature that consists of heuristics and practices for tuning and usually yields non-reusable configurations. The core idea in Ottertune is centered around leveraging runtime traces from past workloads submitted to the DBMS, and training ML models that allow better predictions on new workloads. It combines a set of supervised and unsupervised machine learning techniques in its processing pipeline to serve diﬀerent purposes ranging from (1) selecting the important knobs it wants to tune, to for example a good ranking amongst all features while the two features from which this polynomial feature was generated have separately lower contributions. In short, the feature importance should be done and analyzed on the final algorithm used for the regression task not using a diﬀerent algorithm.
(b) Single vs global model training: Ottertune trains a global model when it performs the Lasso knob selection, but when it comes to regression on the end target it trains separate GP models, one per workload. If for some few workloads a particular subset of knobs have an important eﬀect on the regression target, but do not seem to aﬀect the other remaining majority of workloads, then these knobs won’t be selected by the algorithm. Hence, a tuning session for a new workload similar to one of the workloads within the particular set, will probably disregard important knobs that may hugely aﬀect its performances.
(c) Characteristics of sampled traces: The knob selection procedure depends heavily on the characteristics of the data collected for training such as how many knobs are tuned in total, and how many distinct values per knob are sampled. For instance, increasing the number of knobs to be tuned requires exponentially increasing the number of sampled traces in order to cover the space of knobs in a fair manner. Since it’s diﬃcult and expensive to aﬀord for sampling a lot of points for a particular workload, it is unlikely that the sampling procedure will cover the diﬀerent knobs in a fair manner as soon as the number of knobs increase. The data skew will be directly reflected in an unfair feature selection which will immediately drop the knobs for which only few values have been sampled across the trace dataset
2. Metric selection. Instead of using the full set of metrics collected while profiling workloads, Ottertune first runs factor analysis to get factors representing the contribution of each metric. Then in the space defined by the diﬀerent factors, it tries to detect the clusters that can be found among the diﬀerent metrics by applying the k-means clustering algorithm as shown in the diagram within Figure 2. Since the k-means algorithm requires setting the number of clusters k, Ottertune resorts to a heuristic to automatically set this value. Then, after the clustering is done, Ottertune retains one metric per cluster, and this metric is chosen to be the closest to the cluster’s centroid. Thus, Ottertune retains in total a number metrics equal to the number of clusters k.
3. Training models Using each of the diﬀerent k retained metrics, and on top of each training workload, Ottertune trains a GP model using data from available configurations. Hence, if we have a total number n of training workloads and k retained metrics, the total number of models trained by Ottertune is n k. This directly implies a scalability issue as soon as the total number of workloads increases, because in order to maintain good predictive power for models for which it has collected additional traces, Ottertune will have to periodically retrain them. We show later in §4.2 of Chapter 4 how by training diﬀerent models, Ottertune can be seen as a workload-centered content based recommendation approach.
4. Workload mapping and tuning Upon the admission of a new workload under some initial configurations, Ottertune tries to use the previously trained n k models in order to predict the k retained metrics for each of the n training workloads with the same input configurations observed for the new workload. It then tries to discretize the values of these predictions (also called binning) before proceeding with mapping the new workload to its nearest neighbor workload. We also explain in Chapter 4 that this mapping step allows us to consider Ottertune as a memory based collaborative approach as well. Once the new workload has been mapped to its nearest neighbor, Ottertune trains a GP model using traces collected for the mapped workload alongside the new traces observed for the new workload. This GP model defines a predictive distribution that Ottertune uses for recommending a configuration vector v^ that minimizes a lower-confidence-bound (LCB) acquisition function given by: v^ = argmin ~(v) ~(v) (1) where ~(v) is the mean of the predicted distribution of target objective (for example runtime latency) over the knobs v and ~(v) is the square root of the variance of the predicted distribution. and are multiplier coeﬃcients that are constants within Ottertune with default values: = 1 and = 3. These parameters usually balance the exploration against the exploitation over the course of optimization. Ottertune resorts to a simple gradient descent in order to minimize this acquisition function.
This mapping scheme suﬀers from a major drawback of thrashing the precious information other models built from training workloads maintain. We introduce in Chapter 5 another paradigm for training/inference that makes use of recent advances in representation learning and that focuses on training global models so that all traces from past training workloads can be leveraged while doing inference. We then show later in our experiments in Chapter 7 of this manuscript how our paradigm achieves lower error rates on model prediction than Ottertune. We also show how these better modeling results translate into better latency improvements when it comes to end-to-end performance.
Paragon and Quasar. Paragon  and Quasar  cast the tuning problem into a recommender system that uses matrix factorization techniques. These systems do not make extensive use of the full trace of runtime metrics, and instead record the runtime latency as the only interaction between configurations and diﬀerent workloads. Upon the submission of a new workload to the system, such techniques are not able to recommend configurations beyond those seen within training data. The reason is that the underlying model learns an embedding vector for the knob configuration alongside learning an embedding vector for the workload, and thus does not allow to predict performances for a custom configuration not present within training data. In our work, we propose instead a neural network based recommender approach that allows exploring configurations beyond those observed within training data by only learning an embedding vector for the workloads while representing configurations by a vector of knob values.
CDBTune. CDBTune  was the pioneer in casting the tuning problem into a reinforcement learning framework. Although it leverages the runtime traces of the running workloads in order to tune their knobs in a blackbox manner, it couples both the modeling and optimization steps while formalizing the tuning problem. This is in contrast to our work that has these two steps separated.
It considers the database system as the reinforcement learning agent and defines the state of the agent by the runtime metrics collected for a past configuration. It also defines the actions as a vector describing the amount by which each knob value should be increased or decreased (if the knob value is numerical). The policy that consists of mapping from a state to a particular action is modeled using a neural network. This type of modeling is eﬀective in multi-tenant environments in which an action consisting of setting a particular value of knobs depends on the actual state of the system. In our use case of tuning workloads alone, we don’t need to have a state, because no matter what is the current state of the system, setting the knobs to a particular value will yield the same results. Finally, CDBTune doesn’t leverage past workload information when a new job is submitted for a tuning session. This makes this approach miss an opportunity of leveraging past traces when a new workload that bears similarity to previously tuned workloads is admitted by the optimizer.
Resource Central. Resource Central  is a new system that aims at predicting the runtime latency of virtual machines (VMs) running in private or public clouds. Accurately predicting the lifetime of a VM helps in increasing utilization of physical machines and preventing the exhaustion of physical resources by colocating VMs that are likely to terminate at the same time. Such a modeling can also help the health management system to schedule non-urgent server maintenance without producing VM unavailability or requiring live migration. The core problem that this paper addresses is the scheduling of virtual machines. It consists of a first step of assigning that requires making predictions before the VM can be profiled. The early predictions can be made by taking as input information collected from the VM booking such as the service name, the deployment time, the operating system, the VM size, the VM roles (IaaS or PaaS), etc. This approach tries to classify workloads into interactive or non-interactive workloads, then to further characterize the workloads it runs Fast Fourier Transform (FFT) to extract features that account for periodicity and other temporal features. We do not keep in our work any temporal information since we are interested in modeling the average performances of Spark workloads.
PerfOrator. PerfOrator  is a system that focuses on finding the best hardware resources for a given execution plan of a cloud query. It is a whitebox approach that builds a resource-to-performance model after analyzing the query plan through operator overloading and query rewriting. For instance, it collects data sizes at the input and output of each stage of the query execution plan in order to train a non-linear regression model that relates the input to output data size at each stage, and thus enables resource optimization. This is in sharp contrast to our work which resorts to a blackbox modeling of the cloud workloads.
WiseDB. WiseDB [42,44] proposes learning-based techniques for cloud resource management. A decision tree is trained on a set of performance and cost related features collected from minimum cost schedules of sample workloads. Such minimum cost schedules are not available in our problem setting.
Other work. Recent work has used neural networks to predict latency of SQL query plans  or learn from existing query plans to generate better query plans . These methods, however, are applicable to SQL queries only. Finally, performance modeling tools [37, 64] use handcrafted models, hence remain hard to generalize.
Workload characterization using representation learning
The Dacapo benchmarks. The work in  has used PCA as a way to validate diversity in benchmarks for the Java programming language. The main contribution in this work is introducing Java benchmarks and not characterization of diﬀerent workloads introduced. Nevertheless, this work uses PCA as a sanity check to visualize diﬀerent modules within the benchmark.
This work outlines that diversity in workloads is achieved because the workloads’ represen-tations are scattered in the 2-dimensional space spanned by the first two components of the PCA. Hence the benchmark can be adopted by researchers wishing to analyze Java programs. Our work, in contrast, introduces representation learning methods in Chapter 5 in order to learn meaningful encodings for an end regression task that serves to predict the runtime latency of workloads. We also provide comparative results with PCA in Chapter 7.
The study in  reveals that trace results are not only sensitive to the benchmarks but also to the underlying architecture, the choice of heap size, and the virtual machine that is used for running the workloads. In our work, we don’t need to worry about the architecture because we are running our workloads on identical hardware, and we are using the same operating system and same Spark versions for our experiments.
Characterizing and subsetting big data workloads. A more recent study  on characterizing big data workloads focused on inferring workloads properties from traces collected from BigDataBench benchmark. The purpose of this study was twofold: to come up with a way to do workload subsetting (selecting representative workloads from a set of workloads) as well as to guide the design of future distributed systems architectures. The study consists of collecting traces from diﬀerent workloads on two big data software stacks (Hadoop and Spark), and then it uses these collected traces in order to analyze the workloads after doing PCA followed by a k-means clustering.
An interesting finding within this study is that PCA was unable to disentangle the algorithm (the workload) from the softare stacks on which it was run. For instance, representations learned from the same algorithm but on two diﬀerent software stacks are not usually clusterd together, while representations learned from two diﬀerent algorithms on the same software stack (either Spark or Hadoop) are close to each other. While in our study we focus on modeling workloads on top of a single software stack (Spark), we have run our workloads with diﬀerent configurations on the same distributed infrastructure in order to study the impact of the configuration on diﬀerent workloads. We had a similar finding in our work regarding PCA as a representation learning technique: PCA couldn’t disentangle the workload characteristics (the algorithm) from the configuration with which the workload was submitted (the configuration is determined by the number of cores and memory allocated for the workload, the degree of parallelism, and other knobs introduced later in Table 2 of Chapter 6)
Table of contents :
1.1 Technical challenges
1.2 Contributions and most significant results achieved
1.3 Outline of the thesis
2 Related Work
2.1 Learning based cloud resource management systems
2.2 Workload characterization using representation learning
2.4 Existing cloud offerings
3 Problem Settings, Environment and System Design
3.1 Real world requirements
3.2 System Design
3.3 Problem Statement
3.4 Workload embeddings properties
4 Recommender based architectures
4.1 Collaborative filtering methods
4.1.1 Memory based methods
4.1.2 Model based methods
4.2 Content based methods
4.3 Hybrid methods
5 Representation learning based architectures
5.1 The need for representation learning
5.2 Architectures overview
5.3 Encoder-decoder architectures
5.4 Siamese Neural Networks
5.5 Hybrid architectures
6 Workloads, Runtime Environment, Sampling and Traces
6.1 Workloads description
6.1.1 Streaming benchmark
6.1.2 TPCx-BB workloads
6.2 Distributed environment
6.3 Trace datasets
6.3.1 Sampling heuristics and knob selection
6.3.2 Collected metrics and preprocessing
7.1 Experimental setup
7.2 Evaluation methodology
7.3 Comparative results of recommender architectures
7.4 Comparative results of representation learning based techniques
7.5 Ablation, Scalability and Mapping studies
7.6 End-to-End experiments
8 Extended Discussions
9.1 Contributions of this thesis
9.1.1 Recommender systems architectures
9.1.2 Representation learning based architectures
9.1.3 Comparative results for modeling
9.1.4 End-to-end comparison to Ottertune
9.2 Future directions