Additional force_scans cost CPU time and impact isolation

Get Complete Project Material File(s) Now! »

VirtualMachines: Hardware-level virtualization

Hardware-level virtualization was essentially achieved by introducing a hypervisor: a new software layer between a guest operating system and the hosting hardware. As A. Tanenbaum explains it, there are two types of hypervisors [45]: type 1 hypervisors run on bare metal, and type 2 hypervisors make use of the abstractions offered by an underlying operating system (OS) (see Figure 2.1). Either way, the goal of the hypervisor is to host multiple virtual machines (VMs) on a single computer. The VMs are accurate, isolated and efficient duplicates of the real machine [38]. Several interesting properties emerged from the use of VMs:
1. Emulation: Hypervisors create the illusion that the VM is in charge of the hardware. This ability to host any OS separately leads to many applications ranging from running legacy software to running OS-exclusive software, and from debugging kernel development to debuggingmulti-OS software development.
2. Safety: Hypervisors are less prone to bugs than OS since they do one thing exclusively: emulate multiple copies of the baremetal. Errors and failures are not likely to propagate from VMs to VMs or to the hypervisor.
3. Security: Hypervisors usually do not allow multiple VMs to access a given physical resource at the same time. The VM-level attack surface is small compared to the process level. Unfortunately, both surfaces are sensitive to hardware designs [28, 26].
4. Economy: Hypervisors save money on hardware, electricity and rack space in data centers because fewer physical machines are needed when single machines are multiplexed into multiple VMs.
Companies specialized in data center management and staffed by experts in the area took advantage of these properties; and gave birth to the Cloud by allowing clients to remotely access their physical resources through virtualization. VMs in the Cloud are undeniably appealing to clients because in contrast to physical machines, they are resizable, already powered, cooled, maintained and upgraded by the provider. On one hand, economies of scale are achieved by deploying multiple clients on the same machine, but on the other hand, their privacies and the quality of their services are put at risk.

Cost of machine virtualization

Tremendous efforts have been made to improve as much as possible the efficiency of machine virtualization. Prior to hardware-assisted virtualization, the trap-and-emulate technique used to prevent a VM from executing sensitive instructions was not enough, and hypervisors had to dynamically translate these instructions. Today, thanks to hardwareassisted features (Intel VT-x, AMD-V), the cost of machine virtualization is now acceptable: i) CPU-wise, additional instructions are incorporated to meet the formal requirements1 of Popek and Goldberg [38]. ii) Memory-wise, additional registers are incorporated to let the MMU access the nested page tables needed for the double address translation (Adams and Agesen [1]). iii) IO-wise, the addition of an IOMMU can provide device isolation [9], and 1All sensitive instructions (that can affect the hypervisor) must be privileged instructions (that can be trapped by the hypervisor).
SR-IOV devices can now appear as multiple separate devices [15]. Unfortunately, it has been reported that despite being small on machines with few cores, the overhead becomes unacceptable on large NUMA machines [48]. Moreover, VMs do not maximize resource utilization and countless dedicated schemes had to be conceived to tackle this limitation.

Improving utilization in a full virtualization context

Very few schemes respect the full virtualization paradigminwhichVMs are non-cooperative black boxes that cannot bemodified. The most common scheme targets the duplication of identical data caused by the deployment of multiple instances of the same guest OS. Thanks to works such as Linux KSM [5], the hypervisor strives to de-duplicate data to save memory. VSwapper [3] is another scheme that respects full virtualization and tries to address the double paging anomaly [20]. This anomaly occurs when both the hypervisor and the VM are running eviction policies that end up contradicting themselves. Goldberg showed that an increase in the size of the memory of the virtual machine without a corresponding increase in its real memory size can lead to a significant increase in the number of page faults. Memory hot-plug emulation by the hypervisor was also suggested as a means of dynamically balancing memory between VMs [127].

Improving utilization in a paravirtualization context

The remaining majority of the schemes fall into the paravirtualization domain because the absence of cooperation between the VM and the hypervisor creates too much complexity. The most extreme form of paravirtualization requires the guest OS to be explicitly ported to communicate with the hypervisor through the use of hypercalls. At the other end of paravirtualization, some less intrusive schemes take advantage of existing interfaces in the guest OS to insert additional communication logic. Ballooning [46, 84] uses a Linux virtio driver [41] and allows the hypervisor to ask the guest to free its memory. PUMA [30, 29] uses the Cleancache API of Linux [90] and allows a VM to lend its unused memory to another remote VM.
To sum up, machine virtualization ends up increasing execution time and memory space, and burdens software and hardware development. Fortunately, containers are less subject to these drawbacks and open the door to a more efficient Cloud.

READ Statistical analysis of the exploration-exploitation dilemma in RL

Containers: Operating System-level virtualization

The virtualization property the most requested by lambda users is the ability to encapsulate and run anywhere an entire environment, including software dependencies, libraries, a runtime code, and data. Singularity is one of the container engines that solely focus on this idea [27], but most of the container engines take advantage of all kernel features available to enforce isolation. There is no such thing as a container kernel object [104]. Therefore, one can define a container as an assembly of kernel isolation features. For example, Docker sells its container engine as a solution to “Build, Ship, and Run Any App, Anywhere” [63], but in the background, it makes use of cgroups, namespaces and security features.

Isolating physical resources with cgroups

Prior to cgroups, utilities such as nice, ionice, mlock, madvise, fadvise, taskset, numactl, trickle [16] and setrlimit could be used to control a single process, but no such things existed to control a group of processes.
Cgroup stands for “control group”. It is a Linux Kernel feature that groups processes hierarchically and distribute system resources along the hierarchy in a controlled and configurable manner [147, 125]. Containers use cgroups to limit, account for, and isolate the physical resource usage. As N. Brown explains [82], there have been some disagreements on considering this grouping of processes as an organization hierarchy or as a classification hierarchy, but both views are correct. In a classification hierarchy, all members cannot be in internal nodes, but in an organization hierarchy, members in charge of managing others are placed inside internal nodes. The current API version of cgroups is very messy and inconsistent across resources, but these issues are going to be fixed in version 2 [126, 150].
The cgroup API is exposed as a virtual filesystem mounted at /sys/fs/cgroup and processes can look up their membership at /proc/PID/cgroup. The remaining of this subsection details the cgroup subsystems listed in Table 2.1. As cgroups are still under development, additional cgroup subsystems will be added such as the rdma cgroup [149] and the memory bank and CPU cache cgroup [51]. More cgroup subsystems could be conceived; for instance, it could be wise to implement a cgroup that controls the memory bandwidth [11].

Table of contents :

1 Introduction
2 Resource Virtualization
2.1 VirtualMachines: Hardware-level virtualization
2.1.1 Cost of machine virtualization
2.1.2 Improving utilization in a full virtualization context
2.1.3 Improving utilization in a paravirtualization context
2.2 Containers: Operating System-level virtualization
2.2.1 Isolating physical resources with cgroups
2.2.2 Isolating resource visibility with namespaces
2.2.3 Restraining attack surface with security features
2.3 Containers and VMs comparison
2.3.1 Comparing stand-alone overheads
2.3.2 Comparing performance isolation and overcommitment
2.3.3 Should VMsmimic Containers?
2.4 Consolidation and Isolation, the best of both worlds
2.4.1 Resource Consolidation
2.4.2 Performance Isolation
2.4.3 Illustrating Consolidation and Isolation with the CPU cgroups
2.4.4 Block I/O, a time-based resource similar to CPU
2.5 Memory, a spatial but not time-based resource
2.5.1 Conclusion
3 Memory and cgroup
3.1 Storing data in main memory
3.1.1 Memory Hierarchy
3.1.2 Spatialmultiplexing
3.1.3 Temporal multiplexing
3.1.4 The need for memory cgroup
3.2 Accounting and limiting memory with cgroup
3.2.1 Event, Stat and Page counters
3.2.2 min, max, soft and hard limits
3.3 Isolating cgroupmemory reclaims
3.3.1 Linux memory pool
3.3.2 Splitting memory pools
3.4 Resizing dynamic memory pools
3.4.1 Resizing anon and file memory pools
3.4.2 Resizing cgroup memory pools
3.5 Conclusion
4 Isolation flaws at consolidation
4.1 Modeling Consolidation
4.1.1 Model assumptions
4.1.2 Countermeasures
4.1.3 Industrial Application atMagency
4.2 Consolidation: once a solution, now a problem
4.2.1 Consolidation with containers
4.2.2 Consolidation without containers
4.2.3 Measuring consolidation errors
4.3 Lesson learned
5 Capturing activity shifts
5.1 Rotate ratio: a lru dependent metric
5.1.1 Detecting I/O patterns that waste memory with RR
5.1.2 Balancing anon and file memory with RR
5.1.3 RR can produce false negatives
5.1.4 Additional force_scans cost CPU time and impact isolation
5.1.5 Conclusion
5.2 Idle ratio: a lru independent metric
5.2.1 IR accurately monitors the set of idle pages
5.2.2 Trade-offs between CPU time cost and IR’s accuracy
5.2.3 Conclusion
5.3 Conclusion
6 Sustaining isolation of cgroups
6.1 Refreshing the lrus with force_scan
6.1.1 Conclusion
6.2 Building opt: a relaxed optimal solution
6.2.1 Applying soft_limits at all levels of the hierarchy
6.2.2 Order cgroups by activity levels with reclaim_order
6.2.3 Stacking generic policies
6.3 Guessing the activity levels
6.3.1 AMetric-driven approach to predict activities
6.3.2 An Event-driven approach to react to activity changes
6.4 Conclusion
7 Evaluation of the metric and the event-driven approaches
7.1 Experimental setup
7.1.1 Workload’s types and inactivitymodels
7.1.2 Schedule of activity shifts and configuration of resources
7.1.3 Throttling issues with Blkio cgroup
7.1.4 Experimental configurations
7.2 Performance analysis
7.2.1 Control Experiments
7.2.2 Event-based solutions
7.2.3 Metric-based solutions
7.3 Page transfer analysis
7.3.1 Rotate ratio solutions
7.3.2 Idle ratio solutions
7.3.3 Event-based solutions
7.4 Conclusion
8 Conclusion and Future works
8.1 Short-termchallenges
8.1.1 Spreading contention on the most inactive containers
8.1.2 Ensuring properties when all containers are active
8.2 Long-termperspectives
8.2.1 Ensuring isolation and consolidation at the scale of a cluster
8.2.2 Maximizing global performance with limited memory
Bibliography