Avoided communication versus memory requirements and redundant flops of the ILU0 matrix powers kernel

Get Complete Project Material File(s) Now! »

Tall and skinny QR (TSQR)

The QR factorization of an n ˆ t matrix P is its decomposition into an n ˆ t orthogonal matrix Q (QtQ “ I) and a t ˆ t upper triangular matrix R. The QR factors can be obtained using Gram Schmidt orthogonalization, Givens rotations, Cholesky decomposition, or Householder reflections.
Tall and Skinny QR (TSQR) introduced in [21], refers to several algorithms that compute a QR factorization of a “tall and skinny” n ˆ t matrix P ( n  » t) based on reduction trees with local Householder QR. The matrix P is partitioned row-wise P “ pP0 P1 … Pppa´1qqt where pa is the number of partitions. The sequential TSQR is based on a flat reduction tree, where pa is chosen so that the row-wise blocks of P fit into cache. Whereas, the parallel TSQR can be based onbinary trees (pa “ number of processors) or general trees (pa ° number of processor). The TSQR version used in CA-GMRES [60] is a hybrid of parallel and sequential QR, where pa is chosen so that the row-wise blocks of P fit into cache. Then each processor is assigned a set of blocks that it factorizes using sequential TSQR to obtain Q and R factors. The obtained t ˆ t R factors are stacked and factorized using LAPACK’s QR.
Assuming that the number of row-wise partitions pa of P is four, then the sequential TSQR starts by performing the Householder QR factorization of the n 4 ˆ t matrix P0 “ Q0R0 where Q0 is an n 4 ˆ t orthonormal matrix, R0 is a t ˆ t upper triangular matrix, and I is the n 4 ˆ n 4 identity matrix.

Modified Gram Schmidt A-orthonormalization

We start by introducing A-orthonormalization of the vectors of Pk`1 against the vectors of all the Pi’s for i † k ` 1, then against each others in section . In section , we discuss versions that save flops and reduce communication. And in section, the parallelization of both kernels is described.

Saving flops in the A-orthonormalization using MGS

Since the A-orthonormalizations are expensive in term of flops, we present another alternative for computing the A-orthonormalizations that reduces the computed flops at the expense of storing more vectors. In Algorithm 7 and Algorithm 6, some matrix vector multiplications are repeatedly computed. For example in Algorithm 7, APk`1p:, 1q is computed t ´ 1 times, APk`1p:, 2q is computed t ´ 2 times, and generally, APk`1p:, iq is computed t ´ i times, which means that the matrix A is accessed pt ´ 1qt 2 times for every call of the algorithm. Thus, it is possible after Aorthogonalizing a vector Pk`1p:, iq to compute and store wi “ APk`1p:, iq. This eliminates the redundant flops and reduces the number of accesses of A to t times, but there is a need to store t extra vectors (wi).
Moreover, it is possible to further reduce the computations and the number of times A is accessed at the expense of storing tk vectors as shown in Algorithm 8, where the multiplication Wk`1 “ APk`1 is first performed by only reading the matrix A once. Then the vectors Wk`1p:, iq are updated and stored.
The A-orthonormalization against previous vectors with flops reduction can be performed as in Algorithm 8. Then, the cost of the A-orthonormalization against previous vectors in Algorithm 8 is p6n ´ 1qt2k ` p4n ` 1qt of the order of 6nt2k flops.

Parallelization of the A-orthonormalization using MGS

In Algorithm 8, at each inner iteration we are A-orthonormalizing the updated vectors Pk`1p:, oq against the vector Pip:, jq, where the vector Pk`1p:, oq is changed at each inner iteration. Thus it is not possible to have a block MGS by eliminating all the for loops. However, it is possible to eliminate one for loop in Algorithm 8 as shown in Algorithm 10, by A-orthonormalizing the whole block Pk`1 against the vector Pip:, jq, where Pk`1p:, oq “ Pk`1p:, oq´pPk`1p:, oqtWip:, jqqPip:, jq for all o “ 1, 2, …, t. Let rPip:, jqst be an n ˆ t block containing t duplicates of the vector Pip:, jq. Then, Pk`1 “ Pk`1 ´ rPip:, jqstdiagpPt k`1Wip:, jqq.

Cholesky QR A-orthonormalization

A-orthonormalizing the n ˆ t full rank matrix Pk`1 is equivalent to a QR factorization Pk`1 “ r Pk`1R where r Pt k`1A r Pk`1 “ I. Thus, Pt k`1APk`1 “ pr Pk`1RqtA r Pk`1R “ RtR and R can be obtained by performing a Cholesky factorization of the SPD matrix Pt k`1APk`1. Then, r Pk`1 “ Pk`1R´1 is obtained by solving the lower triangular system Rt r Pt k`1 “ Pt k`1 with multiple righthand sides. This procedure is called A-CholQR and summarized in Algorithm 20 [64, 59]. Similarly to the other A-orthonormalization methods, we may assume that Wk`1 is already computed, then the obtained A-CholQR is described in Algorithm 21. Note that the only difference between the A-CholQR, Algorithms 20 and 21, for A-orthonormalizing Pk`1, and the CholQR algorithm [72] for orthonormalizing Pk`1 is in the definition of C. In A-CholQR C “ Pt k`1APk`1, whereas in CholQR C “ Pt k`1Pk`1.
The parallelization of Algorithm 21 assumes that we have t processors and each is assigned a rowwise part of Pk`1 and Wk`1 corresponding to the (i subset of indices defined previously, Pk`1p(i, :q and Wk`1p(i, :q. And each processor i should compute r Pk`1p(i, :q and ÄWk`1p(i, :q.


Parallelizable variants of the Krylov subspace methods

The classical Krylov subspace methods, discussed in the previous section, are governed by Blas1 and Blas2 computations like dot products and matrix vector multiplication. Parallelizing dot products is not efficient due to the negligible amount of performed flops with respect to the cost of the data movement. The solution multiply a matrix by a set of vectors and solve small systems instead of matrix vector multiplications and dot products. For example, in block methods the idea is to solve a system with multiple right hand sides. Whereas in s-step methods the idea is to merge  the computations of s iterations of classical Krylov methods in order to compute s matrix-vector multiplications at a time. Communication avoiding methods are based on s-step methods with algorithmic and implementation level improvements for avoiding communication communication. In sections 3.2.1, 3.2.2, and 3.2.3 we discuss block methods, s-step methods and communication avoiding methods respectively. Finally in section 3.2.4 we discuss two CG parallelizable variants, cooperative CG (coop-CG) and multiple search direction CG (MSD-CG). In coop-CG, the idea is having multiple agents or threads that cooperate to solve the system Ax “ b, t times in parallel starting with t initial distinct guesses. In MSD-CG, a domain decomposition method, the idea is to have multiple search directions at each iteration. These search directions are defined on different subdomains.

Table of contents :

1 Introduction 
2 Preliminaries 
2.1 Notation
2.1.1 Communication
2.2 Graphs and partitioning techniques
2.2.1 Graphs
2.2.2 Nested dissection
2.2.3 K-way graph partitioning
2.3 Orthonormalization
2.3.1 Classical Gram Schmidt (CGS)
2.3.2 Modified Gram Schmidt (MGS)
2.3.3 Tall and skinny QR (TSQR)
2.4 The A-orthonormalization
2.4.1 Modified Gram Schmidt A-orthonormalization
2.4.2 Classical Gram Schmidt A-orthonormalization
2.4.3 Cholesky QR A-orthonormalization
2.5 Matrix powers kernel
2.6 Test matrices
3 Krylov Subspace Methods 
3.1 Classical Krylov subspace methods
3.1.1 The Krylov subspaces
3.1.2 The Krylov subspace methods
3.1.3 Krylov projection methods
3.1.4 Conjugate gradient
3.1.5 Generalized minimal residual (GMRES) method
3.2 Parallelizable variants of the Krylov subspace methods
3.2.1 Block Krylov methods
3.2.2 The s-step Krylov methods
3.2.3 Communication avoiding methods
3.2.4 Other CG methods
3.3 Preconditioners
3.3.1 Incomplete LU preconditioner
3.3.2 Block Jacobi preconditioner
3.3.3 Restricted additive Schwarz preconditioner
4 Enlarged Krylov Subspace (EKS) Methods 
4.1 The enlarged Krylov subspace
4.1.1 Krylov projection methods
4.1.2 The minimization property
4.1.3 Convergence analysis
4.2 Multiple search direction with orthogonalization conjugate gradient (MSDO-CG) method
4.2.1 The residual rk
4.2.2 The domain search direction Pk
4.2.3 Finding the expression of ↵k`1 and « k`1
4.2.4 The MSDO-CG algorithm
4.3 Long recurrence enlarged conjugate gradient (LRE-CG) method
4.3.1 The LRE-CG algorithm
4.4 Convergence results
4.5 Parallel model and expected performance
4.5.1 MSDO-CG
4.5.2 LRE-CG
4.6 Preconditioned enlarged Krylov subspace methods
4.6.1 Convergence
4.7 Summary
5 Communication Avoiding Incomplete LU(0) Preconditioner 
5.1 ILU matrix powers kernel
5.1.1 The partitioning problem
5.1.2 ILU preconditioned matrix powers kernel
5.2 Alternating min-max layers (AMML(s)) reordering for ILU(0) matrix powers kernel
5.2.1 Nested dissection + AMML(s) reordering of the matrix A
5.2.2 K-way + AMML(s) reordering of the matrix A
5.2.3 Complexity of AMML(s) Reordering
5.3 CA-ILU0 preconditioner
5.4 Expected numerical efficiency and performance of CA-ILU0 preconditioner
5.4.1 Convergence
5.4.2 Avoided communication versus memory requirements and redundant flops of the ILU0 matrix powers kernel
5.4.3 Comparison between CA-ILU0 preconditioner and block Jacobi preconditioner
5.5 Summary
6 Conclusion and Future work 
Appendix A ILU(0) preconditioned GMRES convergence for different reorderings 


Related Posts