Task Based Programming Models where Tasks run on Distributed Memory

Get Complete Project Material File(s) Now! »

Timeline of Task Based High Performance Computing

HPC or High Performance Computing is a computer science field which consists in aggregating computing power in order to obtain higher performances than a regular desktop computer. This power is currently achieved by connecting several computers to form a cluster for the small cases or a supercomputer for the very large cases. In the current supercomputers, the computing resources are connected to allow mixing parallel computing (by taking advantage of multi-cores processors and large amount of processors) and distributed computing (the processor memory caches are not directly connected and must be accessed by performing communications through the network). Initially, the increase in performance of an application was obtained from the improvement of the processors used to run the application and especially from the increase of the frequency of the processors. Indeed, the processors were becoming faster with each new generation and thus, the applications ran faster on them. Then, the frequency became impossible to increase without the processors producing too much heat in the early 2000s [14]. Therefore, either the processors had to be cooled at the cost of more power and infrastructure or setting the frequency of processors to a reasonable temperature so that the processors do not produce an excessive amount of heat and shut down so that they do not melt. Moreover, the data transfer speed between the memory and the processor lowest cache is slower than the time needed by the cores to use it. Thus, the frequency of processors has stabilized since then. At this point, high performance application developers could not rely on the improvement of individual processors to achieve performance. Instead of using one processing unit to run applications, several of them could be used at the same time to run multiple operations at the same time and create faster applications. This introduced the use of parallelism in order to achieve high performances on the HPC computing resources. Using multiple computing resources at the same time was not new since it was already researched in the 1980s with the Connection Machine [15] for instance. The change from the use of single core processors to the use of several multi-core processors also made a shift in the programming models used to run eﬃcient applications on those architectures. Indeed, using several processors means that data from one processor has to be made available somehow to the other processors if they need this piece of data during their computation. Usually, this is done by sending message containing the data through the network connecting the processors between themselves. This greatly changed the way to implement applications for supercomputers since the developers also have to manage the data mapping on the diﬀerent computing resources and the eventual data migrations to perform in order to obtain the intended results.
Parallel and distributed programming models such as Message Passing Interface (MPI) [1], Parallel Vir-tual Machine (PVM) [16], Linda [17] [18] and P4 [19], which are based on message-passing, made their apparition in the early 1990s to address communications between several processors of supercomputers. In this programming model, the application uses several processes to make multiple computations at the same time on diﬀerent cores and uses MPI or PVM to send data from one process to another one. They pro-vide point-to-point and collective communication operations to help the developers to reorganize, migrate or perform operations on their distributed data. However, developing a parallel high performance application with such a programming model requires the knowledge of parallel and distributed programming as well as the specificities of the targeted hardware. It takes time and may not be portable to other architectures without investing more time to make the structural modifications that may be necessary, for instance, to run on GPUs. An alternative that aims to reduce those costs and eﬃciently use the available hardware is task based programming models. Moreover, MPI may not be a solution eﬃcient enough on exascale machines, especially in terms of fault tolerance and check-pointing [4]. Fault tolerance allows applications to manage hardware or software errors on computing resources either by continuing to run the application without this resource if possible or by cleanly stopping the application. Check pointing consists in saving a snapshot of the application a regular interval, so that applications can restart from the last snapshot in case of failure or stop. They are commonly used in long running applications to avoid re-running the application from the start in case something happens. It can also be used to see the evolution of data during the run of the application. For instance, this could be used to check the evolution of a simulation or a neural network training.
Besides, global operations like reductions, gathers and broadcasts are very expensive due to the high number of resources partaking in the operation and the cost of sending information to distant resources on the network connecting the nodes. Task-based approach can help in managing fault tolerance and check-pointing since the tasks could be restarted on another location and data from tasks saved at any moment. Tasks allow to separate the expression of the parallelism from its parallel implementation by letting the developer express the tasks and their dependencies while the runtime of the programming models tries to run as much tasks as possible at the same time, respects the dependencies and tries to obtain the best performances possible. This means that application experts can express algorithms through graph of tasks without being required to understand the hardware in detail. Furthermore, the task-based approach can help to eliminate large scale collective communications by encapsulating them inside tasks. Then, these tasks can run on a subset of the resources allocated to the application and execute collective communications
on a smaller scale. The graph of tasks can be eﬃciently scheduled so that the execution of tasks optimizes data migrations, IOs tasks and data check-pointing. In task based programming models, data can only be exchanged between tasks as their input and output parameters as opposed to exchanging data during the execution of the task. Therefore the algorithms using collective communications have to be redesigned in order to avoid them or rewrite them as task operations.
Several programming models which support the usage of tasks have appeared over the years. They implemented the task definition and management in diﬀerent ways and have their unique features. Some of them will be introduced in the following section.

Current Task Based Programming Models

A task can be defined as an atomic set of operations, with a defined set of data as input and output, which can be asynchronously executed while enforcing data and/or control dependencies. This section introduces programming models supporting the definition and the scheduled execution of tasks.

Task Based Programming Models on Shared Memory

First, task based programming models that are designed to work with shared memory architectures are introduced. Usually, the programming models are based on multi-threading where threads can be considered as workers which can execute the tasks. The scope of the task is limited to the thread although QUARK [20] supports multi-threaded tasks. Some of the programming models introduced here can also oﬄoad data to accelerators like GPUs and execute tasks on them. TensorFlow [9] is introduced in this dissertation since it based on tasks. However, tasks are limited to algebra and tensor operations implemented by the TensorFlow developer team since TensorFlow is Artificial Intelligence and Deep Learning oriented. PyTorch [21] and MindSpore 1 are alternative frameworks to Ten-sorFlow with similar interfaces and functioning. They will not be covered in this dissertation since they have concepts similar to TensorFlow and are only providing Artificial Intelligence based tasks.

Cpp-Taskflow

Cpp-Taskflow [22] [23] is a C++ parallel programming library based on the task dependency graph model. In Cpp-Taskflow, a tasks is an instance of the C++ Callable object on which the operation std::invoke is applicable and is used to run the tasks. Then, they can be declared into a taskflow object from the class tf::Taskflow which allow to create a task dependency graph and schedule them for execution. The user can express task dependency with the method precede applied on the task handler returned from the task creation. Tasks are executed through tf::Executor which runs the taskflow and executes the tasks on threads through a work-stealing algorithm. With work-stealing algorithms, workers without tasks to execute can steal tasks scheduled in other worker queues which respects dependencies in order to execute as much tasks in parallel as possible. Cpp-Taskflow has the feature of dependency graph composition. It allows the user to create dependency graphs and reuse them to compose larger graphs. It is also possible to make recursive and nested compositions [24]. Taskflow objects can be composed multiple times and the result can also be composed. Therefore, it allows to easily create large and complex parallel workloads.

OpenMP

OpenMP [2] is an API which provides a portable and scalable model to develop shared memory parallel applications. OpenMP is based on a fork-join programming model where the parallel regions and loops are specified by pragmas. Team of threads are spawned during parallel regions. Parallel loops are split and mapped on multiple threads so that several iterations of the loop can be executed at the same time. Tasks were introduced in OpenMP 3 [25] in which tasks can be created with the pragma omp task and synchronized with the pragmas omp taskwait and omp barrier. In OpenMP 4, the task model was extended with data dependencies. It introduced keywords for data dependencies : in for consumed data, out for produced data and inout for data that will be modified during the task. It allows lock-less and more fine-grained synchronizations between tasks.

QUARK

QUeueing And Runtime for Kernels (QUARK) [20] is a runtime environment designed to schedule and execute applications that consist of precedence-constrained kernel routines on shared memory systems. The main principle behind QUARK is the implementation of the dataflow model where the scheduling depends on data dependencies between tasks (routines) in a task graph. The routines have parameters as input and output for computations which are used by QUARK to form an implicit DAG connecting the routines. Each parameter has to be extended by its size and its access mode (in, out, inout) depending if the data are consumed, produced or modified by the routine. This allows QUARK to enforce the data dependencies between the routines while preserving the nature of the original code. QUARK also supports multi-threaded tasks in order to manage large bottleneck tasks which would lock the computations by needing a large portion of the data as parameters.

Block-Based Gaussian Elimination to Solve Linear Systems

The scalar Gaussian elimination algorithm with pivoting is discussed in [76] and [77]. The pivoting increase the numerical stability of the operations. However, in the case of block algorithms with a generalization of the scalar operations into a matrix operation, pivoting the rows of sub-matrices means that we need to find a way to determine which matrix could be used as a pivot. Moreover, the increase in numerical stability of such method is unknown. This topic is not discussed in this work. Furthermore, the algorithm which is used as base for generalization in the LU factorization does not use pivoting on the sub-matrices. Although, pivoting may be used to improve the numerical stability in the inversion of the sub-matrices. Algorithm 5 introduces a scalar Gaussian elimination without pivoting [72] which will be generalized into a block algorithm in Algorithm 6. These two algorithms are used to solve a linear system and include the back substitution which allow to compute the solution of the linear system.

Block-Based Gauss-Jordan Elimination to Solve Linear Systems

The Gauss-Jordan elimination [78] [79] is similar to the Gaussian elimination except that the computations made under the diagonal are also performed above it while directly outputting the solution of the linear system. Algorithm 7 is a scalar Gauss-Jordan elimination algorithm without pivoting [79]. It is indeed very similar to Algorithm 5 which is the Gaussian elimination with the mentioned modifications. This algorithm is also converted into a generalized block Gauss-Jordan elimination [80] in Algorithm 8.
The Gauss-Jordan elimination has more block operations than the block Gaussian elimination and the block LU factorization but the critical path of the block operations is even shorter than the Gaussian Elimination since there is no solution of triangular systems. Although, there are more block operations than the two other block algorithms but the critical path is shorter. This means that there are more block operations that can be performed in parallel during the elimination. This also means that this algorithm may be able to run faster than the two other block algorithms on a parallel machine.
In this section, we introduced the diﬀerent meaning of block algorithm : the algorithms where the scalar operations are generalized into matrix operations or algorithms where the operations are reorganized in order to perform matrix operations (usually matrix product) on subset of the initial matrix. Moreover, we introduced the scalar and block algorithms for the LU factorization, the Gaussian elimination and the Gauss-Jordan elimination.

Table of contents :

Introduction
1.1 Motivations
1.2 Objectives and Contributions
1.3 Outline
2 Task Based High Performance Computing
2.1 Timeline of Task Based High Performance Computing
2.2 Current Task Based Programming Models
2.2.1 Task Based Programming Models on Shared Memory
2.2.2 Task Based Programming Models on Distributed Memory
2.2.3 Task Based Programming Models where Tasks run on Distributed Memory
2.3 Exascale Challenges of Supercomputers
3 Methods
3.1 Dense Linear Algebra
3.1.1 Block-Based LU factorization to Solve Linear Systems
3.1.2 Block-Based Gaussian Elimination to Solve Linear Systems
3.1.3 Block-Based Gauss-Jordan Elimination to Solve Linear Systems
3.2 Sparse Linear Algebra
3.2.1 Sparse Matrix Storage and Sparse Matrix-Vector Multiplication
3.2.2 Parallelization
3.2.3 Optimizations
3.2.4 Test Matrices
3.3 Kirchhoff seismic pre-stack depth migration
3.3.1 Overview
3.3.2 Velocity model
3.3.3 Data gathering
3.3.4 Green functions
3.3.5 Kirchhoff migration
3.3.6 Analysis of the output by the geophysicists
4 Languages
4.1 Message Passing Interface
4.2 Task Based Programming Models
4.2.1 PaRSEC
4.2.2 Legion
4.2.3 Regent
4.2.4 TensorFlow
4.2.5 HPX
4.3 Parallel and Distributed Task Based Programming Models
4.3.1 YML+XMP
4.3.2 Pegasus
4.3.3 Swift
4.4 Analysis and First Evaluation of the Languages
4.5 Application Deployment with Containers
5 Task-Based, Parallel and Distributed Dense Linear Algebra Applications
5.1 Task-Based Graphs of Methods to Solve Dense Linear Systems
5.1.1 Block-Based Gaussian Elimination
5.1.2 Block-Based Gauss-Jordan Elimination
5.1.3 Block-Based LU factorization
5.1.4 Block-Based Resolution of Block Triangular Systems
5.2 Usage of YML+XMP and Experimentations
5.2.1 Experiments on the K computer
5.2.2 Experiments on Poincare, a cluster from La Maison de la Simulation
5.2.3 Prediction of the optimal parameters
5.2.4 Results summary
5.3 Several Graph Based Language to Compute the LU factorization
5.3.1 Experiments details
5.3.2 Performances
5.3.3 Strong scaling
5.3.4 Results summary
5.4 Synthesis and Perspectives
6 Task-Based, Parallel and Distributed Sparse Linear Algebra Applications
6.1 Task-Based Methods for Parallel and Distributed Algorithms
6.1.1 Data Distribution
6.1.2 Tasks Definition
6.1.3 Task-Based Parallel Algorithms
6.2 Numerical Experiments
6.2.1 Application Description
6.2.2 Results and Analyses on Total Petascale Pangea II
6.3 Synthesis and Perspectives
7 Task-Based, Parallel and Distributed Kirchhoff Seismic Pre-Stack Depth Migration Application
7.1 Algorithms
7.1.1 Basic Algorithm
7.1.2 Parallelism
7.1.3 Task algorithm
7.1.4 Use of GPUs
7.1.5 Pre-fetching and spreading the data
7.2 2D Implementation
7.2.1 C Kernel Description
7.2.2 Distributed and Parallel Applications
7.3 2D Numerical Experiments
7.3.1 Strong Scaling
7.3.2 Weak Scaling
7.3.3 Variation of the Number of OpenMP Threads
7.4 Synthesis and Perspectives
8 Taxonomy of Task-Based Programming Models and Recommendations
8.1 Taxonomy
8.1.1 Task Capabilities
8.1.2 Task and Data Management
8.1.3 Programming Model Features
8.2 Taxonomy Summary
8.3 Analyze and Recommendations
8.3.1 Adapted Programming Model to Algorithm Granularity
8.3.2 Data Migrations
8.3.3 Encapsulated Tasks
8.3.4 Dependencies Expression
8.3.5 Dynamical Task Scheduling
8.3.6 High Level Languages
8.3.7 Fault Tolerance
8.3.8 Check-Pointing
8.3.9 Multi-Level Programming
8.3.10 Collective Operations
8.4 Preliminary Methodology for Post-Petascale Programming
8.5 Synthesis and Perspectives
9 Conclusion and Perspectives
9.1 Conclusion
9.2 Future Researches
References