Communication Optimized Implementation with MPI

Get Complete Project Material File(s) Now! »

Timeline of HPC

The terms high-performance computing and supercomputing are sometimes used interchangeably. A supercomputer is a computing platform with a high level of performance, that consists of a large number of computing units connected by a local high-speed computer bus. HPC technology is implemented in a variety of computationally intensive applications in the multidisciplinary fields, including biosciences, geographical simulation, electronic design, climate research, neu-roscience, quantum mechanics, molecular modeling, nuclear fusion, etc. Supercomputers were introduced in the 1960s, and the performance of a supercomputer is measured in FLOPS. Since the first generation of supercomputers, the megascale performance was reached in the 1970s, and the gigascale performance was passed in less than decade. Finally, the terascale performance was achieved in the 1990s, and then the petascale performance was crossed in 2008 with the instal-lation of IBM Roadrunner at Los Alamos National Laboratory in the United States. Fig. 2.1 shows the peak performance of the fastest computer every year since the 1960s.
systems in the world and publishes an updated list of the supercomputers twice a year. The project aims to provide a reliable basis for tracking and detecting trends in HPC and bases rankings on HPL [76], a portable implementation of the high-performance LINPACK benchmark written in Fortran for distributed-memory computers. HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. The algorithm used by HPL can be summarized by the following keywords:
• two-dimensional block-cyclic data distribution;
• right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths;
• recursive panel factorization with pivot search and column broadcast combined;
• various virtual panel broadcast topologies;
• bandwidth reducing swap-broadcast algorithm;
• backward substitution with look-ahead of depth 1.
According to the latest Top500 list in November 2018, the fastest supercomputer is the Summit of the United States, with a LINPACK benchmark score of 122.3 PFLOPS. Sunway TaihuLight, Sierra and Tianhe-2A follow closely the Summit, with the performance respectively 93 PFLOPS, 71.6 PFLOPS, and 61.4 PFLOPS. Fig. 2.2 gives the No. 1 machine of HPL per-formance every year since 1993.

Modern Computing Architectures

Diﬀerent architectures of computing units are designed to build the supercomputers. In this section, the state-of-the-art of modern CPUs and accelerators will be reviewed.

CPU Architectures and Memory Access

The development of modern CPUs is based on the Von Neumann architecture. The proposition of this computer architecture is based on the description by the mathematician and physicist John von Neumann and others in the First Draft of a Report on the EDVAC [214]. All the modern computing units are related to this concept, which consists of five parts:
• A processing unit that contains an arithmetic logic unit and processor registers;
• A control unit that contains an instruction register and program counter;
• Memory that stores data and instructions;
• External mass storage;
• Input/output mechanisms.
As shown in Fig. 2.3a, the Von Neumann architecture uses the shared bus between the program memory and data memory, which leads to its bottleneck. Since the single bus can only access one of the two types of memory at a time, the data transfer rate between the CPU and memory is rather low. With the increase of CPU speed and memory size, the bottleneck has become more of a problem.
The Harvard architecture is another computer architecture that physically separates the storage and bus of the instructions and data. As shown in Fig. 2.3b, the limitation of a pure Harvard architecture is that the mechanisms must be provided to load the programs to be executed into the instruction memory and load any data onto the input memory. Additionally, read-only technology of the instruction memory allows the computer to begin executing a pre-loaded program as soon as it is powered up. The data memory will now be in an unknown state, so it is not possible to provide any kind of pre-defined data values to the program.
Today, most processors implement the separate pathways like the Harvard architecture to support the higher performance concurrent data and instruction access, meanwhile loosen the strictly separated storage between code and data. That is named as the Modified Harvard archi-tecture (shown as Fig. 2.3c). This model can be seen as the combination of the Von Neumann architecture and Harvard architecture.
Modified Harvard architecture provides a hardware pathway and machine language instruc-tions so that the contents of the instruction memory can be read as if they were data. Initial data values can then be copied from the instruction memory into data memory when the program starts. If the data is not to be modified, it can be accessed by the running program directly from instruction memory without taking up space in the data memory. Nowadays, most CPUs have Von Neumann like unified address space and separate instruction and data caches as well as memory protection, making them more Harvard-like, and so they could be classified more as modified Harvard architecture even using unified address space.

Table of contents :

READ THE EXPERIENCE OF LAND LOSS IN THE BOOK OF LAMENTATIONS

List of Figures
List of Tables
List of Algorithms
List of Symbols
1 Introduction
1.1 Motivations
1.2 Objectives and Contributions
1.3 Outline
2 State-of-the-art in High-Performance Computing
2.1 Timeline of HPC
2.2 Modern Computing Architectures
2.2.1 CPU Architectures and Memory Access
2.2.2 Parallel Computer Memory Architectures
2.2.3 Nvidia GPGPU
2.2.4 Intel Many Integrated Cores
2.2.5 RISC-based (Co)processors
2.2.6 FPGA
2.3 Parallel Programming Model
2.3.1 Shared Memory Level Parallelism
2.3.2 Distributed Memory Level Parallelism
2.3.3 Partitioned Global Address Space
2.3.4 Task/Graph Based Parallel Programming
2.4 Exascale Challenges of Supercomputers
2.4.1 Increase of Heterogeneity for Supercomputers
2.4.2 Potential Architecture of Exascale Supercomputer
2.4.3 Parallel Programming Challenges
3 Krylov Subspace Methods
3.1 Linear Systems and Eigenvalue Problems
3.2 Iterative Methods
3.2.1 Stationary and Multigrid Methods
3.2.2 Non-stationary Methods
3.3 Krylov Subspace methods
3.3.1 Krylov Subspace
3.3.2 Basic Arnoldi Reduction
3.3.3 Orthogonalization
3.3.4 Different Krylov Subspace Methods
3.4 GMRES for Non-Hermitian Linear Systems
3.4.1 Basic GMRES Method
3.4.2 Variants of GMRES
3.4.3 GMRES Convergence Description by Spectral Information
3.5 Preconditioners for GMRES
3.5.1 Preconditioning by Selected Matrix
3.5.2 Preconditioning by Deflation
3.5.3 Preconditioning by Polynomials – A Detailed Introduction on Least Squares Polynomial Method
3.6 Arnoldi for Non-Hermitian Eigenvalue Problems
3.6.1 Basic Arnoldi Methods
3.6.2 Variants of Arnoldi Methods
3.7 Parallel Krylov Methods on Large-scale Supercomputers
3.7.1 Core Operations in Krylov Methods
3.7.2 Parallel Krylov Iterative Methods
3.7.3 Parallel Implementation of Preconditioners
3.7.4 Existing Softwares
3.8 Toward Extreme Computing, Some Correlated Goals
4 Sparse Matrix Generator with Given Spectra
4.1 Demand of Large Matrices with Given Spectrum
4.2 The Existing Collections
4.2.1 Test Matrix Providers
4.2.2 Matrix Generators in LAPACK
4.3 Mathematical Framework of SMG2S
4.4 Numerical Algorithm of SMG2S
4.4.1 Matrix Generation Method
4.4.2 Numerical Algorithm
4.5 Our Parallel Impementation
4.5.1 Basic Implementation on CPUs
4.5.2 Implementation on Multi-GPU
4.5.3 Communication Optimized Implementation with MPI
4.6 Parallel Performance Evaluation
4.6.1 Hardware
4.6.2 Strong and Weak Scalability Evaluation
4.6.3 Speedup Evaluation
4.7 Accuracy Evaluation of the Eigenvalues of Generated Matrices with respect to the Given Ones
4.7.1 Verification based on Shift Inverse Power Method
4.7.2 Experimental Results
4.7.3 Arithmetic Precision Analysis
4.8 Package, Interface and Application
4.8.1 Package
4.8.2 Interface to Other Programming Languages
4.8.3 Interface to Scientific Libraries
4.8.4 GUI for Verification
4.9 Krylov Solvers Evaluation using SMG2S
4.9.1 SMG2S workflow to evaluate Krylov Solvers
4.9.2 Experiments
4.10 Conclusion
5 Unite and Conquer GMRES/LS-ERAM Method
5.1 Unite and Conquer Approach
5.2 Iterative Methods based on Unite and Conquer Approach
5.2.1 Preconditioning Techniques
5.2.2 Separation of Components
5.2.3 Benefits of Separating Components
5.3 Proposition of UCGLE
5.3.1 Selection of Components
5.3.2 Workflow
5.4 Distributed and Parallel Implementation
5.4.1 Component Implementation
5.4.2 Parameters Analysis
5.4.3 Distributed and Parallel Manager Engine Implementation
5.5 Experiment and Evaluation
5.5.1 Hardware
5.5.2 Parameters Evaluation
5.5.3 Convergence Acceleration Evaluation
5.5.4 Fault Tolerance Evaluation
5.5.5 Impacts of Spectrum on Convergence
5.5.6 Scalability Evaluation
5.6 Conclusion
6 UCGLE to Solve Linear Systems in Sequence with Different Right-hand Sides
6.1 Demand to Solve Linear Systems in Sequence
6.2 Existing Methods
6.2.1 Seed Methods
6.2.2 Krylov Subspace Recycling Methods
6.3 UCGLE to Solve Linear Systems in Sequence by Recycling Eigenvalues
6.3.1 Relation between Least Squares Polynomial Residual and Eigenvalues
6.3.2 Eigenvalues Recycling to Solve Sequence of Linear Systems
6.3.3 Workflow to Recycle Eigenvalues
6.4 Experiments
6.4.1 Hardware
6.4.2 Results
6.4.3 Analysis
6.5 Conclusion
7 UCGLE to Solve Linear Systems Simultaneously with Multiple Right-hand Sides
7.1 Demand to Solve Linear Systems with Multiple RHSs
7.2 Block GMRES Method
7.2.1 Block Krylov Subspace
7.2.2 Block Arnoldi Reduction
7.2.3 Block GMRES Method
7.2.4 Cost Comparison
7.2.5 Challenges of Exising Methods for Large-scale Platforms
7.3 m-UCGLE to Solve Linear Systems with Multiple Right-hand Sides
7.3.1 Shifted Krylov-Schur Algorithm
7.3.2 Block Least Squares Polynomial for Multiple Right-hand Sides
7.3.3 Analysis
7.4 Implementation of m-UCGLE and Manager Engine
7.4.1 Component Allocation
7.4.2 Asynchronous Communication
7.4.3 Implementation on Hetergeneous Platforms with GPUs
7.4.4 Checkpointing, Fault Tolerance and Reusability
7.4.5 Practical Implementation of m-UCGLE
7.5 Parallel and Numerical Performance Evaluation
7.5.1 Hardware Settings
7.5.2 Software Settings
7.5.3 Test Problems Settings
7.5.4 Specific Experimental Setup
7.5.5 Convergence Evaluation
7.5.6 Time Consumption Evaluation
7.5.7 Strong Scalability Evaluation
7.5.8 Fault Tolerance Evaluation
7.5.9 Analysis
7.6 Conclusion
8 Auto-tuning of Parameters
8.1 Auto-tuning Techniques
8.1.1 Different Levels of Auto-tuning
8.1.2 Selection Modes
8.2 UCGLE Auto-tuning by Automatic Selection of Parameters
8.2.1 Selection of Parameter for Auto-tuning
8.2.2 Multi-level Criteria
8.2.3 Heuristic and Auto-tuning Algorithm
8.2.4 Experiments
8.3 Conlusion
9 YML Programming Paradigm for Unite and Conquer Approach
9.1 YML Framework
9.1.1 Structure of YML
9.1.2 YML Design
9.1.3 Workflow and Dataflow
9.1.4 YvetteML Language
9.1.5 YML Scheduler
9.1.6 Multi-level Programming Paradigm: YML/XMP
9.1.7 YML Example
9.2 Limitations of YML for Unite and Conquer Approach
9.2.1 Possibility Analysis to Implement UCGLE using YML
9.2.2 Asynchronous Communications
9.2.3 Mechanism for Convergence
9.3 Proposition of Solutions
9.3.1 Dynamic Graph Grammar in YML
9.3.2 Exiting Parallel Branch
9.4 Demand for MPI Correction Mechanism
9.5 Conclusion
10 Conclusion and Perspective
10.1 Synthesis
10.2 Future Research
Bibliography