Current hypervisor NUMA management – Project topics materials

Get Complete Project Material File(s) Now! »

The Factored Operating System

The previously presented works make communication explicit by avoiding to use shared memory. Instead they maintain private state that can be synchronized with message passing. This message passing is implemented with cache coherency.
Wentzlaff et al. [69] argue that implementing a cache coherency protocol may not even be possible in a computer with thousands of cores. Indeed, even if the software rarely use cache coherency, a local cache controller with a cache line in an Invalid state still needs to probe other cache controllers if the core requires the cache line. Thus, the cache controllers still exchangemessages, even if the cores only use private data structures.
The Factored Operating System (FOS) is an operating system designed to execute in a computer without a global cache coherency. The FOS is based on an idea similar to theMultikernel [9]: execute an autonomous part of the system on each core. Unlike the Multikernel, the FOS does not maintain any global state. Instead, each core works as a server that performone task: either execute application threads, one at a time, or execute a kernel service.
Additionally, each core maintains a name cache. The name cache is a routing table that indicate what core to contact for a given service to execute. The cores communicate with message passing that is architecture dependent.

The Scalable Commutativity Rule

The works presented in this section address the communication effects on performance. The first work, namely Remix [32], remove the useless cache coherency messages. Indeed false sharing messages do not actually maintain coherency because the cores work on distinct variables. The next works, namely Scalloc [3] and the sloppy counter [14], reduce the amount of exchanged message by batching them. The cores make most of the updates on private data and commit these changes to a global state when required. The two last works, namely theMultikernel [9] and the FOS [69], reduce the amount of requiredmessages. Because they are designed like distributed systems, theymaintain private states, one per core that rarely or never have to be synchronized. All of these presented works however require that the cores synchronize between each others at some point. Clements et al. describe a general method to design a software which does not require that the core synchronize. To be so, a software must provides an interface that follows the Scalable Commutativity Rule [22]. This rules states that, given a set S of operation executed concurrently, the result of an operation o executed after S must not depends on the ordering of operations inside S. If a software interface follows that rule, then it exists an implementation of this interface that can executes the operations concurrently with no need of synchronization.
By using this rule, a programmer can design a systemwhich does not require any expensive communication. This actually solves contention issues by making the multicore aspect of the underlying hardware part of the software conception.

Non UniformMemory Access architectures

As we show in Section 2.1, cores synchronizations cause the serialization of operations and hence, poor performance. We show that a good software design reduces the number of core synchronizations on shared data, which improve application performance. However, in a simple multicore architecture, as illustrated in Figure 2.1, the cores still share hardware components. Indeed, each core accesses the memory through a single memory controller and a single system bus. Thus, even if each core accesses to its own private data space, all the cores still share the same memory controller. Since this controller serves the core requests sequentially, it causes the serialization of operations.
The local memory caches of the cores serve a large part of the memory requests. When there is no more than about ten cores in the machine, the number of memory requests sent to the memory controller is then too low to cause a serialization. By contrast, in a machine with several tens of cores, the cachemechanism is not sufficient anymore. It then becomes necessary that several cores can access the memory concurrently. In a Non Uniform Memory Access (NUMA) architecture, the available cores and memory banks are partitioned in several nodes, with each node having a separated bus and dedicated memory controllers. All the nodes are bound together with an internal network called the interconnect. This ar2.2. chitecture is illustrated in Figure 2.3. Using several nodes and thus several buses spreads the memory load across the nodes. This section first presents how the cores communicate in a NUMA architecture. In particular, we describe the HT Assist cache coherency protocol: an extension of the MOESI protocol for NUMA architectures. We also describe the NUMA hardware that we use for our evaluations. We then explain how Linux handle the distributed aspect of the memory. To this end, we present two NUMA memory management policies provided by the Linux kernel. These explanations help to understand what aspects of a NUMA hardware impact the application performance. Finally, we present some existing works that focus on how to use the NUMA architecture specificities to bring good applications performance.

Architecture details

In a NUMA architecture, there are several memory controllers that can serve requests concurrently. Each of these controller serves the requests to its associated memory banks. A cache coherent NUMA (ccNUMA) architecture presents these memory banks as a single address space to the software: the first gigabytes stand for thememory banks of the first node, the following gigabytes for the next node and so on. Thanks to this unified address space, legacy applications can execute on ccNUMA computers. In the remainder of this thesis, we focus exclusively on ccNUMA architecture. We thus simply use the term NUMA instead of ccNUMA.
As suggested by the architecture name, the memory access time, in a NUMA architecture, varies depending on the memory bank a core access to. Intuitively, we can assume that a core access a memory bank of this node faster than a memory bank of a remote node. In result, accessing to the unified memory address space without considering the underlying memory banks located in different nodesmay lead to poor application performance. In this part, we describe how a NUMA hardware exposes a unified address space out of several memory banks accessed through independent memory controllers. We also give some key characteristics of the hardware we use for our evaluations.

READ Logistics and supply chain management

Everything about Synchronization

The Lock Cohorting technique is not the only work that tackles the locking efficiency in a NUMA architecture. Other designs [27, 55] has been proposed to efficiently lock in this architecture. David et al. do not propose a new lock design but instead, give a quite complete study of the relations between the locking system, the hardware and the software.
This study, named Everything You AlwaysWanted to KnowAbout Synchronization butWere Afraid to Ask [26], evaluates and describes the behavior of several state-of-the-art locks over several types of hardware, and especially NUMA hardware, with several workloads. As this work does not give any new solution but focuses on understanding how existing solutions behave, we here analyze the observations the studymakes.
The first observation is that, as expected, synchronizing cores of different nodes kills the performance. More formally, the acquisition of a lock owned by a thread on a remote node can be a dozen times more expensive than if the owner thread was on the local node. Interestingly, this phenomenon worsens as the lock gets (i) more contended and (ii) more complex. This second assertion does not stand if the complexity is added to make the lock NUMA aware.
The second observation is that, considering the NUMA aspect of the hardware at a high level is not always sufficient. The study highlights a surprising consequence of theHTAssist protocol described in Section 2.2. Even if a synchronization, or any concurrent memory modification, is limited to the cores of a single node, it may lead to internode broadcasts.
Indeed, if a cache line is in the Shared or Owned state, amodification of this line forces the home node to broadcast invalidation probes. This occurs even if only the local cores own a copy of the cache line. David et al. propose a fix that consists in forcing the cache line state to Exclusive/Modified with a dedicated instruction.
The third observation is that the lock complexity often adds overhead in NUMA architecture, but can improve performance under specific conditions. As David et al. state in the first observation, using a lock acrossNUMAnodes is more expensive when this lock is more complex. However, under high contention, these locks perform better than the simple locks. This study also gives other observations that are not related to the NUMA architecture. We choose to elude them and summarize the three observations that we present. First, software aware of the high level NUMA topology is more efficient by avoiding internode data sharing, increasing the locality. Second, software aware of the low level NUMA details is more efficient by avoiding hardwaremispredictions or degenerated behaviors. Third, software aware of the end user application details, like the locks contention, is more efficient by choosing an appropriate strategy. This third point is not specific to NUMA. However, it is closely related to the necessity for a system to know the needs of the end user application to choose the appropriate strategy, whether a lock implementation or a NUMA policy.

Dynamic Binary Translation for same ISA

A special case of DBT occurs when the guest and the host have the same ISA. This is typically the case for operating system development or debugging. In this case translating an instruction block is straightforward. The DBT emulator only replaces privileged instructions by calls to the emulator code. The processor hence executes most of the guest code unmodified. In this common scenario, theDBT emulation only causes a small performance overhead.
In the remainder of this document, we call hypervisor or Virtual Machine Monitor (VMM), an emulator specifically designed to execute virtual machines with the same ISA than the physical machine [60, 36, 35, 17]. We now focus exclusively on hypervisors since Cloud computing virtualization relies on these programs.

Hardware Assisted Virtualization

The paravirtualization technique reduces the number of interceptions that the hypervisor makes and thus the number of context switches between the guest and the host. It however has some drawbacks. First, the guest operating system has to be paravirtualization enabled. Second, the paravirtualization works well only for privileged operations that update stateful devices, like disk controllers. This is because the complete emulation of a stateful device requires several interceptions and thus several expensive context switches while paravirtualizating this same device requires only one context switch. For other special but frequent operations, like performing a syscall, the paravirtualization does not help since it results in one context switch anyway. With the growing amount of Cloud services relying on virtualization, making hypervisors efficient became an economical necessity. Processor manufacturers then began to add virtualization support in their chips [51, 54, 33, 68, 30]. With this hardware assistance, modern hypervisors emulate hardware for unmodified guest systems with good performance.
In this thesis, we focus on this last type of virtualization. For this reason, we describe the Hardware Assisted Virtualization with more details in the next section.

Table of contents :

Contents
1 Introduction
2 State of the Art
2.1 Multicore architectures
Architecture details
Related challenges
2.2 Non UniformMemory Access architectures
Architecture details
Linux NUMA policies
Related challenges
2.3 System virtualization
Technical details
Architecture details
Related challenges
2.4 Software settings
2.5 Conclusion
3 TheWell-Known Bottlenecks
3.1 The Virtualized I/O Overhead
Hardware Emulation
The I/OMemoryManagement Unit
Evaluation of IOMMU
3.2 The virtualized IPI overhead
Usage of the IPI
Implementations of IPI
The libactive library
Evaluation of vIPI
3.3 The Xen load balancer
Completely Fair Scheduler
Credit Scheduler
The libpin
Evaluation of pining scheme
3.4 Conclusion
4 The NUMA Bottleneck
4.1 NUMA policies under study
Comparison of NUMA policies effects
The NUMA policy selectionmetric
4.2 Current hypervisor NUMA management
The default round-1G policy
Huge pages and splinterring
Evaluation of the default policy
4.3 Conclusion
5 Virtualization of NUMA Architectures
5.1 Improved hypervisor NUMA management
The Xen implementation of round-4K
The Xen implementation of Carrefour
Limitations of PEBS with virtualization
The Xen implementation of first-touch
Limitations of first-touch with IOMMU
The Xen NUMA policy selection interface
5.2 Evaluation
Evaluation of vNUMA on a singlemachine
Evaluation of vNUMA on severalmachines
5.3 Conclusion
6 Conclusion
6.1 Future works
6.2 Perspectives
Bibliography