Communication between hardware threads

Get Complete Project Material File(s) Now! »

Hardware threads

Until the early 2000’s, most consumer CPUs were only able to run a single software thread at any given time, i.e., they only provided a single hardware thread. Multitasking was handled by schedulers, which used time-sharing to enable several software threads to use the CPU concurrently. The first technique that was commonly used to introduce parallelism in CPUs was instruction pipelines: each instruction is broken up into a number of steps, and different steps from consecutive instructions are executed in parallel [93]. With the introduction of multicore CPUs, the parallelism provided by the multiple cores was combined with the parallelism provided by the instruction pipeline, in order to improve performance.
In many multicore architectures, several software threads can be running at a given time if they are placed on different cores. Moreover, in order to increase parallelism even more, some CPU manufacturers do not only bundle several CPU cores into a single CPU, they also replicate some parts of the cores in order to make each one of them execute several software threads simultaneously: from the point of view of the developer, each core provides multiple hardware threads that are completely independent execution units, and one software thread can be running on each of these hardware threads at any given time [106]. In practice, however, two hardware threads running concurrently on the same core may slow down each other more than if they were running on different cores, because they share resources such as, for instance, their Arithmetic and Logic Unit (ALU). The idea of hardware multithreading is to increase parallelism at a lower cost than by adding more cores, because some components that are not usually bottlenecks can be shared by several hardware threads.
Each hardware thread must have access to the following components, some of which possibly being shared: (i) A set of registers that store the data that is currently being used by the hardware thread (generally not shared), (ii) an arithmetic and logic unit, often completed with a Floating Point Unit (FPU) for operations on floating point numbers, (iii) a Transaction Lookaside Buffer (TLB), which is a cache that is used to accelerate the translation of virtual memory addresses into physical addresses, and (iv) elements that are located outside CPU cores and sometimes shared by several cores, as shown in Figure 2.1, such as data and instruction CPU caches, or mem-ory/interconnect controllers. Hardware multithreading can be implemented using three techniques: coarse-grained multithreading, fine-grained multithreading, and simultaneous multithreading.
Coarse-grained multithreading. Also called block multithreading or cooperative multithread-ing, coarse-grained multithreading lets a thread run until it is blocked by an event that causes a long enough stall, such as a cache miss or a page fault. The thread will not be scheduled again until the data or signal it was waiting for has arrived. Each hardware thread must have its own set of data and control registers, so that the CPU can switch between hardware threads in a single CPU cycle. Fine-grained multithreading. Also called interleaved multithreading, with fine-grained multi-threading, the CPU starts executing an instruction from a different hardware thread at each cycle, in a round-robin fashion. Since instructions from several threads are executed in parallel in the pipeline, each stage in the pipeline must track which thread’s instruction it is processing. More-over, since more threads are executed in parallel than with coarse-grained multithreading, shared resources such as the TLB and CPU caches must be larger so that they do not become bottlenecks.
Simultaneous multithreading (SMT). It is the most advanced implementation of hardware multithreading, and it is designed for superscalar CPUs. In a traditional superscalar CPU with one hardware thread, several instructions are issued from a single thread at each CPU cycle. With SMT, CPUs issue several instructions from multiple threads at each CPU cycle. This requires to track which thread’s instruction is being processed for each thread at each stage of the pipeline. However, SMT has the advantage to use issue slots better than traditional superscalar processors, because single threads only have a limited amount of instruction-level parallelism, whereas multiple threads are typically independent from each other. SMT is used in some Intel (HyperThreading) and Sun/Oracle CPUs. In particular, SMT is used by the UltraSPARC T2+, the CPU used by Niagara2-128, one of the machines described in Section 2.5.2 and used in the evaluation in Chapter 5.

Communication between hardware threads

This section describes the means by which hardware threads communicate with each other. Section 2.3.1 describes CPU caches and how they are used for communicating between hardware threads. It also presents cache-coherent and non-cache-coherent architectures. Section 2.3.2 focuses on NUMA architectures.

CPU caches

An overview of CPU caches is given in Section 2.3.1.1. Cache-coherent and non-cache-coherent architectures are described in Sections 2.3.1.2 and 2.3.1.3, respectively.

Cache-coherent architectures

On multicore architectures, each core has its own local caches. If different caches hold different versions of the same data, two hardware threads may not have the same view of the global memory, and it may be hard for hardware threads to ensure that the data they are reading is up-to-date. To prevent this, most CPUs nowadays are cache-coherent, which means that even though cores have their own local caches with possibly different versions of the same data, a cache-coherence protocol is implemented in the hardware in order to make sure that different hardware threads have a consistent view of the global memory.
An architecture is said to be cache-coherent if the following conditions are met [86]:
• If a hardware thread t writes then reads from a location l, with no writes from other hardware threads at the same location between the write and the read operation, then t must read the value it wrote at the location l.
• A read from a hardware thread t1 to a location l that was previously written by t2 must return the value that t2 wrote at that location if enough time passed between the read and the write operation, and if no other hardware thread wrote data at l between the read and the write operation.
• Writes to the same location are serialized, i.e., two writes to the same location by any two hardware threads are seen in the same order by all hardware threads.
Section 2.3.1.2.a describes cache-coherence protocols that ensure that all hardware threads have a consistent view of the global memory. Section 2.3.1.2.b presents instructions that facilitate synchronization between hardware threads. Finally, Section 2.3.1.2.c gives a quick overview of common bottlenecks on cache-coherent architectures.

Cache-coherence protocol

In cache-coherent architectures, hardware threads typically rely on the cache-coherence protocol for communication: when a hardware thread t1 needs to send a value to a hardware thread t2, it writes the value at a known memory address, and t2 reads it later, possibly busy-waiting for the value at that address to be modified by t1. Most cache-coherence protocols that are used in current multicore machines are based on the MESI protocol [85]. The name MESI comes from the four possible states it defines for cache lines: Modified, Exclusive, Shared and Invalid. These states can be described thus:
• Modified. The data in the cache line has been modified locally (i.e., it is dirty), and only resides in this cache. The copy in the main memory is is not up to date, therefore, if the cache line is evicted or changes its state, its data must be flushed to the main memory.
• Exclusive. The data in the cache line is unmodified (i.e., it is clean), and it does not reside in any other cache.
• Shared. The data in the cache line is clean but other copies may reside in other caches.
• Invalid. The cache line does not contain valid data. This typically happens when a shared cache line was modified in one of the caches: other copies of the data got invalidated by the cache-coherence protocol.
While the MESI protocol is functional, its performance is not optimal on architectures with a large number of cores because of its communication overhead. In particular, the MESI protocol may send many high-latency messages that contain redundant data: if a cache requests data that resides in many different caches (Shared state), all caches may send the same data to the requesting cache, which results in wasted bandwidth. Another drawback of the MESI protocol is that the only way for data from a Modified cache line to be accessed by remote hardware threads is to flush that data to the main memory and fetch it again, when fast cache-to-cache communications should be sufficient. Improved versions of the MESI protocol, with more states, have been implemented to solve these issues. Two widely-used variants are the MESIF protocol and the MOESI protocol. MESIF protocol. The MESIF protocol has been used by Intel CPUs since the Nehalem architecture. It adds a Forwarded state to the MESI protocol and modifies its Shared state. The Forwarded (F) state is a variant of the Shared state that expresses the fact that the cache should act as the designated responder for that cache line. At most one cache holds a copy of data in the Forwarded state. If a cache requests data that exists in various caches, and if one of the caches holds a copy of the data that is in the Forwarded state, only that cache will send the data: no redundant messages are sent. If no version of the data is in the Forwarded state, the data will be fetched from the main memory, which may induce a overhead. This can happen if a cache line that was in the Forwarded state was evicted. To avoid this issue, the most recent requester of the data is automatically assigned the Forwarded state, which decreases the risk of Forwarded cache lines getting evicted.
MOESI protocol. The MOESI protocol is used by AMD and Sun/Oracle CPUs. It adds an Owned state and modifies the Shared state of the MESI protocol. The Owned (O) state expresses the fact that the cache holds one of the copies of a cache line (as with the Shared state) and has the exclusive right to modify it. All modifications to that cache line must be broadcast to all other caches that own it (in the Shared state): this allows for direct core-to-core communication without going through the main memory. An Owned cache line may change its state to Modified after invalidating all shared copies, and it may change its state to Shared after flushing the data to memory. The semantics of the Shared state in the MOESI protocol are modified: unlike with the MESI protocol, a Shared cache line may hold invalid data if an Owned cache line holds the correct, most recent version of the data. The Owned cache line is responsible for eventually flushing its data to the main memory. If no Owned cache line holds the data, the Shared cache line holds data that is ensured to be valid. Shared cache lines may change their state to Exclusive or Modified after invalidated all other shared copies.
As shown in Figure 2.3, with the MOESI protocol, when two hardware threads from the same die communicate together, all they need to use is their local cache and the minimum subset of caches that they share. When hardware threads from different dies or CPUs communicate, the data must go through the interconnect, but no access to the main memory is needed.

READ  ramses: A numerical N-body and HD code using adaptive mesh refinement (AMR)

Instructions used for synchronization

While, on cache-coherent architectures, reading and writing data to shared memory locations is sufficient for basic communication between hardware threads, specific instructions are sometimes needed for more complex synchronization schemes. First, because CPUs automatically reorder independent instructions, some specific instructions can be used to ensure system-wide ordering constraints between read and write operations of different hardware threads: these instructions are known as memory barriers. Second, it is sometimes useful to execute several operations in a way that appears atomic to other hardware threads. This can be done thanks to atomic instructions. Memory barriers. Modern CPUs use out-of-order execution, i.e., they may reorder instruc-tions to improve performance: instructions that can be instantly executed are sometimes executed before earlier instructions that would cause a stall waiting for their input data. While out-of-order execution is completely transparent in architectures that provide a single hardware thread, it can cause unpredictable behavior in architectures that provide several hardware threads: the reordering of instructions is designed to be transparent for the hardware thread that executes them, but other hardware threads see the side effects of these instructions in the real order in which they are executed. Memory barriers, also known as memory fences, make it possible to enforce an ordering constraint on memory operations issued before and after the barrier instruction. Most modern architectures (including x86) do not ensure that stores may not be reordered after loads to different addresses. To prevent this, a store fence can be issued between the store and load instructions. Atomic instructions also act as memory barriers: all pending load and store operations must be executed before the atomic instruction.1
Atomic instructions. The instruction set of current CPUs often includes a set of atomic instructions. These instructions combine several operations whose execution appears to be atomic to hardware threads, i.e., no other instruction from any hardware thread can modify the shared data they operate on during their execution. Atomic instructions make it possible for programs to modify shared data without having to acquire locks. Common atomic instructions include: (i) test-and-set, which reads (and returns) the value v1 at a given address and replaces it with a new value v2 if v1 is non-zero2, (ii) fetch-and-store, which reads (and returns) the value v1 at a given memory address and replaces it with a new value v2, (iii) fetch-and-add, which fetches a value v1 at a given address, adds a value v2 to v1 and stores the result at v1’s address, (iv) atomic swap, or atomic exchange, which swaps the values at two given addresses in memory, and (v) Compare-And-Swap (CAS), which compares the value v1 at a memory address to a value v2 and, in case of equality, writes a value v3 at v1’s address. A whole class of algorithms, named lock-free algorithms, rely exclusively on atomic instructions instead of locks for synchronization [56, 77, 52, 40, 62, 63].

Non-cache-coherent architectures

As seen in the previous section, the cache-coherence protocol incurs an overhead, and this overhead may get worse as the number of cores increases. To prevent this, some manufacturers have designed non-cache-coherent architectures, in which each core owns part of the global memory, and hardware threads must use message-passing in order to request data from other cores. Non-cache-coherent CPUs include Intel’s Single Chip Cloud Computer (SCC) and to some extent, Tilera’s TILE-Gx CPUs. In both the SCC and the TILE-Gx CPUs, cores are organized in a grid. Writing code for non-cache-coherent architectures can be very complex since cores cannot simply read and write from known memory addresses to communicate, and must rely on custom message-based protocols instead. Moreover, operating systems and applications have to be rewritten for these architectures. Fortunately, non-cache-coherent multicore architectures have many common points with distributed systems, and research on distributed operating systems has been ongoing for decades [66, 91, 104]. This led to the design of some research operating systems for non-cache-coherent architectures such as Corey [13] and Barrelfish [9] that solely rely on message-passing. Similarly, some software components such as garbage collectors [114] have been written for non-cache-coherent architectures. However, it will take a lot of time for operating systems and other software components to become as feature-rich as currently-used legacy software that has been developed for decades on cache-coherent architectures. Moreover, it has been shown [14] that legacy operating systems such as Linux can be made to scale on current cache-coherent multicore architectures with dozens of cores.

Table of contents :

1 Introduction 
2 Multicore architectures 
2.1 Overview
2.2 Hardware threads
2.3 Communication between hardware threads
2.3.1 CPU caches
2.3.1.1 Overview
2.3.1.2 Cache-coherent architectures
2.3.1.2.a Cache-coherence protocol
2.3.1.2.b Instructions used for synchronization
2.3.1.2.c Bottlenecks
2.3.1.3 Non-cache-coherent architectures
2.3.2 NUMA architectures
2.4 Hetereogeneous architectures
2.5 Machines used in the evaluation
2.5.1 Magnycours-48
2.5.2 Niagara2-128
2.5.3 Performance comparison
2.5.3.1 Cache access latencies
2.5.3.2 Contention overhead
2.5.3.3 Application performance
2.5.3.4 Summary
2.6 Conclusion
3 Lock algorithms 
3.1 Blocking locks
3.2 Basic spinlock
3.3 CLH
3.4 MCS
3.5 Time-published locks
3.6 Oyama
3.7 Flat Combining
3.8 CC-Synch and DSM-Synch
3.9 Comparison of lock algorithms
3.10 Other lock algorithms
3.11 Conclusion
4 Contribution 
4.1 Remote Core Lock
4.1.1 Core algorithm
4.1.2 Implementation of the RCL Runtime
4.1.2.1 Ensuring liveness and responsiveness
4.1.2.2 Algorithm details
4.1.3 Comparison with other locks
4.2 Tools
4.2.1 Profiler
4.2.2 Reengineering legacy applications
4.3 Conclusion
5 Evaluation 
5.1 Liblock
5.2 Microbenchmark
5.3 Applications
5.3.1 Profiling
5.3.2 Performance overview
5.3.3 Performance of SPLASH-2 and Phoenix applications
5.3.4 Performance of Memcached
5.3.5 Performance of Berkeley DB with TpccOverBkDb
5.3.5.1 Experimental setup
5.3.5.2 Performance analysis
5.3.5.3 Yielding the processor in busy-wait loops
5.4 Conclusion
6 Conclusion 
A French summary of the thesis 
A.1 Introduction
A.2 Contribution
A.2.1 Algorithme de RCL
A.2.2 Outils
A.3 Évaluation
A.3.1 Microbenchmark
A.3.2 Applications
A.4 Conclusion

GET THE COMPLETE PROJECT

Related Posts