NUMA Architectures: Topologies and Management Solutions

Get Complete Project Material File(s) Now! »

NUMA Architectures: Topologies and Management Solutions

NUMA architectures constitute an important class of parallel architectures. These largescale systems organize the physically shared memory across several nodes connected through cache-coherent high-performance links. With such a design, threads either access the memory resources located on the same node as the core executing the thread —the local node— or on a different node —a remote node (cf. Figure 4.1). Consequently, uncontrolled data placement can yield traffic contention when all threads are accessing the same memory bank. Furthermore, remote data access increase memory latencies.
NUMA systems were introduced as a cure to the scalability issues of symmetric multiprocessors with Uniform Memory Access (UMA). Unfortunately, NUMA-unaware programs running on NUMA platforms tend not to benefit from the additional computational power and memory bandwidth offered by the multiple nodes of the system. Figures 4.2 and 4.3 depict two different 2-nodes NUMA core distributions; The Pau machine follows a cyclic distribution of 32 cores (including hyperthreading) whereas the Taurus machine follows a block distribution of 24 cores (without hyperthreading). Different types of management solutions exist, ranging from operating systems features to research contributions in the area of parallel programming languages.

Case Study: Beyond Loop Optimizations for Data Locality

The purpose of this section is to highlight the benefits of integrating NUMA-awareness in optimizing compilers for data locality. As a proof-of-concept, Pluto [39] is our demonstration framework. Pluto is based on the polyhedral model, intensively used for the optimization of loop nests that fit the model’s constraints (and many numerical applications do). Today, polyhedral tools are capable of producing highly optimized parallel code for multicore CPUs [39], distributed systems [37], GPUs [137] or FPGAs [103]. It is, however, interested to note that NUMA optimizations are not applied. In the tool flow of our demonstration framework [130], we clearly seperate optimizations handled by Pluto with those we introduce; Pluto is in charge of all control flow optimizations and parallelism extraction, whereas our post-pass implements data placement on NUMA systems. Furthermore, as we did not emphasize on layout transformations in the previous chapter, we take this opportunity to introduce them — more specifically transpositions — along with NUMA optimizations. Indeed, layout transformations are often not first-class citizens in polyhedral tools. As Pluto outputs may be complex, we handle affordable cases.

Reproducing Loop Optimization Paths

For the tensor kernels from Table 5.1 we have identified the fastest program variants that can be generated with Pluto by manipulating its heuristics for loop fusion, tiling, interchange, vectorization, and thread parallelism. Table 5.3 lists the TeML equivalents of the loop optimization paths that caused Pluto to generate the fastest programs. Note that the TeML transformations vectorize and parallelize have been implemented with compiler-specific pragmas for vectorization and OpenMP pragmas for thread parallelization. The stencil kernel blur is not included in Table 5.3 since Pluto’s best optimization path for this kernel performs loop skewing, which cannot yet be expressed in TeML. Also, as standard matrix multiplication has been thoroughly studied, our analysis focuses on bmm, which presents a less conventional computation pattern.
Since the sequences of TeML loop transformations in Table 5.3 reproduce the effects of Pluto’s optimizations, C programs generated either from TeML or Pluto for the kernels listed in Table 5.3 have equal execution times. For space reasons we have omitted plots of these execution times since they would only show relative speed-ups of 1:0 (within measurement accuracy) between Pluto and TeML.

READ  Dynamic TDM-based Arbitration Hardware Design

Table of contents :

Table of Contents
1 Introduction 
1.1 Parallel Architectures, Programming Languages and Challenges
1.2 Research Context
1.3 Thesis Contributions and Outlin
2 Intermediate Representation for Explicitly Parallel Programs: State-ofthe- art 
2.1 Intermediate languages
2.2 Program Representations Using Graphs
2.3 The Static Single Assignment Form
2.4 The Polyhedral Model
2.5 Discussion
2.5.1 Observations
2.5.2 Perspectives
3 The Tensor Challenge 
3.1 Numerical Applications
3.2 Computational Fluid Dynamics Applications: Overview
3.3 CFD-related Optimization Techniques
3.3.1 Algebraic Optimizations
3.3.2 Loop Transformations
3.4 Envisioned Tool Flow
3.5 Existing Optimization Frameworks
3.5.1 Linear and Tensor Algebra Frameworks
3.5.2 Levels of Expressiveness and Optimization Control
3.6 Outcomes
4 The NUMA Challenge 
4.1 NUMA Architectures: Topologies and Management Solutions
4.1.1 Operating Systems
4.1.2 NUMA APIs
4.1.3 Languages extensions
4.2 Case Study: Beyond Loop Optimizations for Data Locality
4.2.1 Experimental Setup
4.2.2 Observations
4.3 Refining NUMA Memory Allocation Policies
4.3.1 Thread array regions and page granularity
4.3.2 Implementation
4.3.3 Limitations
4.4 Data Replications Implementation
4.4.1 Conditional Branching
4.4.2 Replication Storage
4.5 On Run-time Decisions
4.6 Outcomes
5 TeML: the Tensor Optimizations Meta-language 
5.1 Overview
5.2 Tensors
5.2.1 N􀀀dimensional Arrays
5.2.2 Compute Expressions
5.2.3 Tensor Operations
5.2.4 Initializations
5.3 Loop Generation and Transformations
5.4 Memory Allocations
5.5 Implementation and Code Generation
5.6 On Data Dependencies
5.6.1 Dependency Checks using Explicit Construct
5.6.2 Decoupled Management
5.7 Evaluation
5.7.1 Expressing Tensor Computations
5.7.2 Reproducing Loop Optimization Paths
5.7.3 Performance
5.8 Conclusion
6 Formal Specification of TeML 
6.1 Formal specification
6.1.1 Domains of trees
6.1.2 State
6.1.3 Valuation functions
6.2 Compositional definitions
6.2.1 Loop transformations
6.2.2 Tensor operations
6.3 Range Inference
6.4 Towards Type Safety
6.5 Conclusion
7 Conclusion and Perspectives 
Bibliography

GET THE COMPLETE PROJECT

Related Posts