Concepts in Task Parallel Programming Languages

Get Complete Project Material File(s) Now! »

Perspectives: Bridging Parallel Architectures and Software Parallelism

You talk when you cease to be at peace with your thoughts. Kahlil Gibran
“Anyone can build a fast CPU. The trick is to build a fast system.” At-tributed to Seymour Cray, this quote is even more pertinent when looking at multiprocessor systems that contain several fast processing units; par-allel system architectures introduce subtle system features to achieve good performance. Real world applications, which operate on large amounts of data, must be able to deal with constraints such as memory requirements, code size and processor features. These constraints must also be addressed by parallelizing compilers that are related to such applications, from the domains of scientific, signal and image processing, and translate sequential codes into eﬃcient parallel ones. The multiplication of hardware intricacies increases the importance of software in order to achieve adequate perfor-mance.
This thesis was carried out under this perspective. Our goal is to de-velop a prototype for automatic task parallelization that generates a parallel version of an input sequential code using an algorithm of scheduling. We also address code generation and parallel intermediate representation issues. More generally, we study how parallelism can be detected, represented and finally generated.
“Anyone can build a fast CPU. The trick is to build a fast system.” At-tribu´ee a` Seymour Cray, cette citation¹ est d’autant plus pertinente quand on consid`ere les syst`emes multiprocesseurs qui contiennent plusieurs unit´es de traitement rapide ; les architectures parall`eles pr´esentent des caract´eristiques subtiles qu’il convient de bien g´erer pour obtenir de bonnes performances. Les applications du monde r´eel, qui n´ecessitent de grandes quantit´es de donn´ees, doivent en plus ˆetre en mesure de faire face a` des contraintes telles que les besoins en m´emoire, la taille du code et les fonctionnalit´es du processeur. Ces contraintes doivent ´egalement ˆetre prises en compte par les compilateurs de parall´elisation, qui, surtout dans les domaines des ap-plications scientifiques et de traitement du signal et d’image, s’eﬀorcent de traduire des codes s´equentiels en des codes parall`eles eﬃcaces. L’accroisse-ment de la complexit´ des mat´eriels augmente l’importance du logiciel charg´e de les g´erer, et ce afin d’obtenir une performance ad´equate.
Cette th`ese a et´ r´ealis´ee dans la perspective que nous venons d’esquisser. Notre objectif est de d´evelopper un prototype logiciel de parall´elisation au-tomatique de tˆaches qui g´en`ere une version parall`ele d’un code s´equentiel d’entr´ee en utilisant un algorithme d’ordonnancement. Nous abordons ´egale-ment la g´en´eration de code parall`ele, en particulier les probl`emes de repr´esen-tation interm´ediaire parall`ele. Plus g´en´eralement, nous ´etudions comment le parall´elisme peut ˆetre d´etect´e, repr´esent´ et finalement g´en´er´.

Parallel Architectures

Up to 2003, the performance of machines has been improved by increasing clock frequencies. Yet, relying on single-thread performance has been in-creasingly disappointing due to access latency to main memory. Moreover, other aspects of performance have been of importance lately, such as power consumption, energy dissipation, and number of cores. For these reasons, the development of parallel processors, with a large number of cores that run at slower frequencies to reduce energy consumption, is adopted and has had a significant impact on performance.
The market of parallel machines is wide, from vector platforms, acceler-ators, Graphical Processing Unit (GPUs) and dualcore to manycore² with thousands of on-chip cores. However, gains obtained when parallelizing and executing applications are limited by input-output issues (disk read/write), data communications among processors, memory access time, load unbal-ance, etc. When parallelizing an application, one must detect the available parallelism and map diﬀerent parallel tasks on the parallel machine. In this dissertation, we consider a task as a static notion, i.e., a list of instruc-tions, while processes and threads (see Section 2.1.2) are running instances of tasks.
In our parallelization approach, we make trade-oﬀs between the avail-able parallelism and communications/locality; this latter point is related to the memory model in the target machine. Besides, extracted parallelism should be correlated to the number of processors/cores available in the tar-get machine. We also take into account the memory size and the number of processors.

Multiprocessors

The generic model of multiprocessors is a collection of cores, each potentially including both CPU and memory, attached to an interconnection network. Figure 2.1 illustrates an example of a multiprocessor architecture that con-tains two clusters; each cluster is a quadcore multiprocessor. In a multicore, cores can transfer data between their cache memory (level 2 cache gener-ally) without having to go through the main memory. Moreover, they may or may not share caches (CPU-local level 1 cache in Figure 2.1 for example). This is not possible at the cluster level, where processor cache memories are not linked. Data can be transfered to and from the main memory only.
Flynn [46] classifies multiprocessors in three main groups: Single In-struction Single Data (SISD), Single Instruction Multiple Data (SIMD) and Multiple Instruction Multiple Data (MIMD). Multiprocessors are generally recognized to be MIMD architectures. We present the important branches of current multiprocessors: MPSoCs and multicore processors, since they are widely used in many applications such as signal processing and multimedia.
Programmable Gate Array (FPGA) platforms and the CELL processor³ [43] are examples of heterogeneous architectures.

Multicore Processors

This model implements the shared memory model. The programming model in a multicore processor is the same for all its CPUs. The first multicore general purpose processors were introduced in 2005 (Intel Core Duo). Intel Xeon Phi is a recent multicore chip with 60 homogeneous cores. The num-ber of cores on a chip is expected to continue to grow to construct more powerful computers. This emerging hardware approach will deliver petaflop and exaflop performance with the eﬃcient capabilities needed to handle high-performance emerging applications. Table 2.1 illustrates the rapid pro-liferation of multicores and summarizes the important existing ones; note that IBM Sequoia, the second system in the Top500 list (top500.org) in November 2012, is a petascale Blue Gene/Q supercomputer that contains 98,304 compute nodes where each computing node is a 16-core PowerPC A2 processor chip (sixth entry in the table). See [25] for an extended survey of multicore processors.
In this thesis, we provide tools that automatically generate parallel pro-grams from sequential ones, targeting homogeneous multiprocessors or mul-ticores. Since increasing the number of cores on a single chip creates chal-lenges with memory and cache coherence, as well as communication between the cores, our aim in this thesis is to deliver the benefit and eﬃciency of mul-ticore processors, by the introduction of an eﬃcient scheduling algorithm, to map parallel tasks of a program on CPUs, in the presence of resource constraints on the number of processors and their local memory size
The Cell configuration has one Power Processing Element (PPE) on the core, acting as a controller for eight physical Synergistic Processing Elements (SPEs).

READ Underground hydrogen storage Quantitative assessment of seasonal underground hydrogen storage from surplus energy in Castilla-León (north Spain)

Memory Models

The choice of a proper memory model to express parallel programs is an important issue in parallel language design. Indeed, the ways processes and threads communicate using the target architecture and impact the pro-grammer’s computation specifications aﬀect both performance and ease of programming. There are currently three main approaches (see Figure 2.2 extracted from [64]).

Table of contents :

Remerciements
Abstract
Introduction (en fran¸cais)
1 Introduction
1.1 Context
1.2 Motivation
1.3 Contributions
1.4 Thesis Outline
2 Perspectives: Bridging Parallel Architectures and Software Parallelism
2.1 Parallel Architectures
2.1.1 Multiprocessors
2.1.2 Memory Models
2.2 Parallelism Paradigms
2.3 Mapping Paradigms to Architectures
2.4 Program Dependence Graph (PDG)
2.4.1 Control Dependence Graph (CDG)
2.4.2 Data Dependence Graph (DDG)
2.5 List Scheduling
2.5.1 Background
2.5.2 Algorithm
2.6 PIPS: Automatic Parallelizer and Code Transformation Frame-work
2.6.1 Transformer Analysis
2.6.2 Precondition Analysis
2.6.3 Array Region Analysis
2.6.4 “Complexity” Analysis
2.6.5 PIPS (Sequential) IR
2.7 Conclusion
3 Concepts in Task Parallel Programming Languages
3.1 Introduction
3.2 Mandelbrot Set Computation
3.3 Task Parallelism Issues
3.3.1 Task Creation
3.3.2 Synchronization
3.3.3 Atomicity
3.4 Parallel Programming Languages
3.4.1 Cilk
3.4.2 Chapel
3.4.3 X10 and Habanero-Java
3.4.4 OpenMP
3.4.5 MPI
3.4.6 OpenCL
3.5 Discussion and Comparison
3.6 Conclusion
4 SPIRE: A Generic Sequential to Parallel Intermediate Representation Extension Methodology
4.1 Introduction
4.2 SPIRE, a Sequential to Parallel IR Extension
4.2.1 Design Approach
4.2.2 Execution
4.2.3 Synchronization
4.2.4 Data Distribution
4.3 SPIRE Operational Semantics
4.3.1 Sequential Core Language
4.3.2 SPIRE as a Language Transformer
4.3.3 Rewriting Rules
4.4 Validation: SPIRE Application to LLVM IR
4.5 Related Work: Parallel Intermediate Representations
4.6 Conclusion
5 BDSC: AMemory-Constrained, Number of Processor-Bounded Extension of DSC
5.1 Introduction
5.2 The DSC List-Scheduling Algorithm
5.2.1 The DSC Algorithm
5.2.2 Dominant Sequence Length Reduction Warranty
5.3 BDSC: A Resource-Constrained Extension of DSC
5.3.1 DSC Weaknesses
5.3.2 Resource Modeling
5.3.3 Resource Constraint Warranty
5.3.4 Efficient Task-to-Cluster Mapping
5.3.5 The BDSC Algorithm
5.4 Related Work: Scheduling Algorithms
5.4.1 Bounded Number of Clusters
5.4.2 Cluster Regrouping on Physical Processors
5.5 Conclusion
6 BDSC-Based Hierarchical Task Parallelization
6.1 Introduction
6.2 Hierarchical Sequence Dependence DAG Mapping
6.2.1 Sequence Dependence DAG (SDG)
6.2.2 Hierarchical SDG Mapping
6.3 Sequential Cost Models Generation
6.3.1 From Convex Polyhedra to Ehrhart Polynomials
6.3.2 From Polynomials to Values
6.4 Reachability Analysis: The Path Transformer
6.4.1 Path Definition
6.4.2 Path Transformer Algorithm
6.4.3 Operations on Regions using Path Transformers
6.5 BDSC-Based Hierarchical Scheduling (HBDSC)
6.5.1 Closure of SDGs
6.5.2 Recursive Top-Down Scheduling
6.5.3 Iterative Scheduling for Resource Optimization
6.5.4 Complexity of HBDSC Algorithm
6.5.5 Parallel Cost Models
6.6 Related Work: Task Parallelization Tools
6.7 Conclusion
7 SPIRE-Based Parallel Code Transformations and Generation
7.1 Introduction
7.2 Parallel Unstructured to Structured Transformation
7.2.1 Structuring Parallelism
7.2.2 Hierarchical Parallelism
7.3 From Shared Memory to Distributed Memory Transformation
7.3.1 Related Work: Communications Generation
7.3.2 Difficulties
7.3.3 Equilevel and Hierarchical Communications
7.3.4 Equilevel Communications Insertion Algorithm
7.3.5 Hierarchical Communications Insertion Algorithm
7.3.6 Communications Insertion Main Algorithm
7.3.7 Conclusion and Future Work
7.4 Parallel Code Generation
7.4.1 Mapping Approach
7.4.2 OpenMP Generation
7.4.3 SPMDization: MPI Generation
7.5 Conclusion
8 Experimental Evaluation with PIPS
8.1 Introduction
8.2 Implementation of HBDSC- and SPIRE-Based Parallelization Processes in PIPS
8.2.1 Preliminary Passes
8.2.2 Task Parallelization Passes
8.2.3 OpenMP Related Passes
8.2.4 MPI Related Passes: SPMDization
8.3 Experimental Setting
8.3.1 Benchmarks: ABF, Harris, Equake, IS and FFT
8.3.2 Experimental Platforms: Austin and Cmmcluster
8.3.3 Effective Cost Model Parameters
8.4 Protocol
8.5 BDSC vs. DSC
8.5.1 Experiments on Shared Memory Systems
8.5.2 Experiments on Distributed Memory Systems
8.6 Scheduling Robustness
8.7 Faust Parallel Scheduling vs. BDSC
8.7.1 OpenMP Version in Faust
8.7.2 Faust Programs: Karplus32 and Freeverb
8.7.3 Experimental Comparison
8.8 Conclusion
9 Conclusion
9.1 Contributions
9.2 Future Work
Conclusion (en fran¸cais)