DESIGN OF A COLLECTIVE COMMUNICATION INFRASTRUCTURE FOR BARRIER SYNCHRONIZATION IN CLUSTER-BASED NANOSCALE MPSOCS

Get Complete Project Material File(s) Now! »

POSIX Threads (PThreads)

PThreads is a standardized C language interface described by the IEEE POSIX 1003.1c standard [IEE16] that specifies a set of thread APIs to do thread synchronization and management. PThreads procedures can be organized into four major groups [Bar19b]: (i) thread management; (ii) mutexes; (iii) condition variables; and (iv) synchronization (rwlocks, barriers). We focus on the last three groups, as they are responsible for dealing with data synchronization. In addition, we assume the default behavior provided by PThreads (i.e., no particular attribute is used). A mutex is useful for protecting shared data from concurrent access. A mutex has two possible states: unlocked (not owned by any thread) and locked (owned by one, and only one, thread). The mutex procedure group contains locking and unlocking. Locking is a blocking procedure that exclusively locks a variable. If the variable is already locked, the calling thread is suspended; otherwise, the operation returns with the variable locked by the calling thread. Unlocking is a non-blocking procedure that changes the variable state and, if there are any waiting threads, wakes up previously blocked procedures. If the developer prefers the thread to spin on the lock instead of suspending it, it may use a spinlock, which has the same behavior as the mutex.
The condition procedure group contains: wait, signal, and broadcast. Wait is an unconditionally blocking procedure that puts the thread on a waiting list for a condition event. It requires that designers previously locked a mutex variable and passed a reference to it. Then, the wait procedure unlocks the mutex once it has finished working. Next, when the thread is woken up, the wait procedure reacquires the mutex. The signal and broadcast are non-blocking procedures that wake up one and all threads respectively waiting for a condition event. Mutex, in these cases, is optional.
The last procedure group comprises barriers and rwlocks. The barrier procedure group contains a single blocking procedure, called wait, which synchronizes participating threads at a user-specified code point. A barrier has a fixed number of threads decided at allocation time. When all participating threads reached the barrier, all threads are woken up. Rwlocks has a similar behavior as a standard lock; however, it differentiates readers from writers. Multiple readers can access the shared data, while only one writer is allowed to modify it. Also, no reader can access the data while there is a writer thread. Both GNU’s Not Unix! (GNU) LibC1 and FreeBSD LibC utilize operating system calls to do sensitive operations such as putting threads to sleep; however, they operate mainly in user space, as shifting to kernel space may incur performance penalties.

Open MultiProcessing (OpenMP)

OpenMP is an API specification for shared-memory parallel applications. The API supports C, C++, and Fortran for multiple architectures. OpenMP uses a fork-join model of parallel execution, as shown in Figure 3.1. All OpenMP programs start as a single thread called the master thread. It executes sequentially until the first parallel region is encountered. Then, the master thread creates multiple threads to handle the parallel work. The master thread waits for all other threads to finish and then continues to execute sequentially; this process can be repeated arbitrarily [Ope15]. Besides procedures, OpenMP relies on compiler directives to control the application behavior use either a mutex lock or a procedure provided by the compiler. GNU Compiler Collection (GCC) and Intel C++ Compiler (ICC) use libgomp and libomp respectively to accomplish the OpenMP specification [GCC19c] [Int19b]. There are two methods for libgomp to handle synchronization: (i) PThreads (Section 3.1.1) if it is available; (ii) own synchronization primitives implementation. Libomp does not use PThreads and implements its synchronization primitives.

Threading Building Blocks (TBB)

Threading Building Blocks (TBB) is an Intel library for parallel applications developed in C++ for the x86 architecture. It offers task-based parallelism that abstracts some of the threading mechanisms. Instead of compiler directives, as done by OpenMP, TBB uses generic programming to fit the object-oriented/template-based programming style of C++ better [Int19a]. Moreover, TBB provides concurrent-friendly data structures for the developer. Hence, the structure handles the synchronization process by itself, either by fine-grained locking or lock-free algorithms [Rei07]. For synchronization, TBB provides the same three basic synchronization primitives as PThreads: mutex, barrier, and condition. Yet, barriers are executed implicitly in template calls and implemented with an additional task with the sole propose of synchronization. TBB implements conditions with the same characteristics and restrictions as described in PThreads (Section 3.1.1). TBB provides several types of primitives for mutexes with contrasting behavior. Table 3.1 shows the traits of some of the mutex types available in TBB, which can be described by the following features [Rei07] [Int19d]:
Scalable2 – A scalable mutex is one that does no worse than forcing single-threaded performance. A mutex can perform worse than serialize execution if it consumes 2As stated by the official manual: « In a strict sense, this is not an accurate name, because a mutex limits execution to one thread at a time » [Int19d].
excessive processor cycles/memory bandwidth. Scalable mutexes are often slower than non-scalable ones under light contention.
Fair – Mutexes can be fair or unfair. A fair mutex lets tasks through in the order they arrive, meaning that fair mutexes avoid starving tasks. On the other hand, unfair mutexes can be faster because they let tasks that are currently running through first, instead of the queued task, which may be sleeping.
Sleeps – Mutexes can cause a task to spin in user space or sleep in kernel space while it is waiting. Spinning is undesirable for long periods, as it consumes multiple processor/cache cycles. For short waits, spinning is faster than sleeping, because putting and waking up tasks takes multiple cycles.

Reordering Constraints for PThread-style Locks

Boehm [Boe07] refines the PThread specification to propose a simpler set of clear and uncontroversial rules, allowing the reordering of memory operations for lock synchroniza-tion primitives. These rules have a significant performance impact since memory barriers typically limit reorder operations. In addition, the author identifies a class of compiler trans-formations that can also increase the performance. For these gains, the author proposes a new subset programming language based on C. The objective to propose such language is to verify the correctness of reordering rules under a simplified version of the C language.
The author justifies the proposition of a new language based on the impact of memory barrier operations. Boehm points out the following scenario: the cost of using lock operations can affect program execution time significantly, even when there is little lock contention. Since the locking cost is often strongly affected by, or even largely determined by, the number of memory barriers, the reduction of memory barrier calls is a worthwhile pursuit. The central question for the Boehm [Boe07] is thus: under what circumstances can load and store operations be moved into a critical section. His findings suggest that the reordering constraints are not symmetric for lock and unlock operations. Additionally, such behavior was not previously recognized as observed by the PThreads implementation examples provided in this paper.
Table 3.2 complements the information previously shown in Table 2.2 for the use of memory barrier on spinlocks. Boehm demonstrates with these tables that there is much confusion in regards to the correct use of memory barrier with the PThreads standard. Discrepancies in its use occur with the glibc and FreeBSD implementations. In addition, since these mistakes require a specific type of architecture, they may go unnoticed for a long time.

Optimization of the GNU OpenMP Synchronization Barrier in MPSoC

France-Pillois et al. [FPMR18] used an instrumented emulation platform to extract precise timing information regarding the use of synchronization barriers of the GNU OpenMP library (i.e., libgomp). They identified that an expansive function was uselessly being called during the barrier waking process. Thus, they propose a software optimization that saves up to 80% of the barrier release phase for a 16-core system. Moreover, as such a change is done at the library level, the optimization is legacy-code compatible.
The evaluation was carried out on the TSAR manycore architecture that supports shared-memory applications. The architecture is organized in four clusters, and each cluster contains four MIPS with a private L1 cache and a shared L2 cache. Figure 3.4 shows the release phase delays by thread arrival. The simulation is a simple for loop executed over 400 times. The X-axis represents the threads in order of release, and the Y-axis represents the instant the thread leaves the barrier to resume its nominal execution flow. Figure 3.4a illustrates that the original version takes up to 13194 cycles to complete the barrier release process and that a single thread is especially delayed compared to others. Such behavior was the motivator behind the study performed by the authors. The release phase for the GNU OpenMP library is comprised of two phases: active and passive. When a thread calls the OpenMP barrier, it spins on a memory value that records the number of threads that have arrived on the barrier. This is named the active phase, as it is occupying the core unit. The barrier is only release when that memory value is equal to the user-specified limit. In the example shown in Figure 3.4, the limit is 16. After a specified time, the comparison is stopped, and the thread is put to sleep on a waiting list. Hence, polling is done, and the thread will be woken up by the last arriving thread. This is called the passive phase. The authors noted that the wake-up procedure was being called even in cases where no threads were sleeping. The for-loop used for this work is an example of an application where the threads will normally not be put to sleep, as the application is well balanced. Even when no thread had to be woken up, the time spent in the wake-up procedure was about 12891 cycles, about 97.7% of the whole release procedure for 16 threads. Hence, an optimization was proposed to decrease the overhead of the release procedure, as shown in Figure 3.4b. Table 3.3 show the gains on the full release procedure for TSAR and Alpha architec-tures. The gains decrease as the number of CPU increase on TSAR, while the gains remain the same on Alpha regardless of the number of CPU, as in the latter case, the cycle latency is also the same.

Table of contents :

1 INTRODUCTION
1.1 CONTEXT
1.2 MOTIVATION
1.3 PROBLEM STATEMENT AND THESIS CONTRIBUTIONS
1.4 DOCUMENT STRUCTURE
2 DATA SYNCHRONIZATION IN PARALLEL APPLICATIONS.
2.1 SYNCHRONIZATION IN UNIPROCESSOR SYSTEMS
2.2 SYNCHRONIZATION IN MULTIPROCESSOR SYSTEMS
2.2.1 MEMORY AND COMPILER BARRIERS
2.2.2 LOCK-BASED APPLICATIONS
2.2.3 LOCK-FREE APPLICATIONS
3 RELATED WORK
3.1 SOFTWARE-ORIENTED SOLUTIONS
3.1.1 POSIX THREADS (PTHREADS)
3.1.2 OPEN MULTIPROCESSING (OPENMP)
3.1.3 THREADING BUILDING BLOCKS (TBB)
3.1.4 READ-COPY-UPDATE (RCU)
3.1.5 REORDERING CONSTRAINTS FOR PTHREAD-STYLE LOCKS
3.1.6 OPTIMIZATION OF THE GNU OPENMP SYNCHRONIZATION BARRIER IN MPSOC
3.2 HARDWARE-ORIENTED SOLUTIONS
3.2.1 HARDWARE TRANSACTIONAL MEMORY (HTM)
3.2.2 A HARDWARE IMPLEMENTATION OF THE MCAS SYNCHRONIZATION PRIMITIVE
3.2.3 CASPAR: BREAKING SERIALIZATION IN LOCK-FREE MULTICORE SYNCHRONIZATION
3.2.4 DESIGN OF A COLLECTIVE COMMUNICATION INFRASTRUCTURE FOR BARRIER SYNCHRONIZATION IN CLUSTER-BASED NANOSCALE MPSOCS
3.2.5 NOTIFYING MEMORIES: A CASE-STUDY ON DATA-FLOW APPLICATION WITH NOC INTERFACES IMPLEMENTATION
3.3 DISCUSSION
3.3.1 THE CHOICE OF PTHREADS
3.3.2 SUBUTAI COMPATIBILITY WITH OTHER LEGACY-CODE COMPATIBLE SOLUTIONS
4 SUBUTAI SOLUTION
4.1 TARGET ARCHITECTURE
4.2 SUBUTAI HARDWARE (SUBUTAI-HW)
4.2.1 SUBUTAI-HW RTL IMPLEMENTATION AND VERIFICATION
4.3 SUBUTAI SOFTWARE
4.3.1 USER SPACE PTHREADS LIBRARY
4.3.2 KERNEL SPACE FUTEX
4.3.3 SUBUTAI IMPLEMENTATION
5 SUBUTAI EXTENSIONS
5.1 CRITICAL-SECTION AWARE (CSA) SCHEDULING POLICY
5.1.1 MOTIVATION
5.1.2 BASELINE SCHEDULER DESIGN
5.1.3 APPLICATION EXAMPLE
5.1.4 DESIGN AND IMPLEMENTATION CHOICES
5.2 NEOCONDITION
5.2.1 MOTIVATION
5.2.2 CONDITION USAGE EXAMPLES FROM THE PARSEC BENCHMARK
5.2.3 DESIGN AND IMPLEMENTATION CHOICES
5.2.4 THE POSITIVE AND NEGATIVE ATTRIBUTES OF NEOCONDITION
6 EXPERIMENTAL RESULTS
6.1 EXPERIMENTAL SETUP
6.2 PARSEC – BENCHMARK SUITE FOR MULTIPROCESSING
6.2.1 BODYTRACK COMMUNICATION MODEL
6.2.2 STREAMCLUSTER COMMUNICATION MODEL
6.2.3 FACESIM COMMUNICATION MODEL
6.2.4 X264 COMMUNICATION MODEL
6.3 EXPERIMENTAL RESULTS
6.3.1 SUBUTAI’S AREA CONSUMPTION AND STATE-OF-THE-ART COMPARISON
6.3.2 PARSEC EXPERIMENTAL RESULTS
6.3.3 MICRO-BENCHMARK
7 CONCLUSIONS.
7.1 CONTRIBUTIONS OF THIS WORK
7.1.1 OTHER CONTRIBUTIONS
7.2 DISCUSSION AND FUTURE WORK
7.2.1 PSY: SYNTHETIC DATA SYNCHRONIZATION COMMUNICATION CREATOR .
7.2.2 ENERGY-AWARE DESIGN EXPLORATION FOR SUBUTAI-HW
7.2.3 BARRIER-AWARE POLICY FOR SCHEDULERS INTENDED FOR PARALLEL
7.2.4 SW-ONLY NEOCONDITION IMPLEMENTATION
7.2.5 QUEUE OPTIMIZER: SCHEDULER-AWARE HARDWARE QUEUE
REFERENCES