Using Offline Clustering with the Pluto Affine Scheduling Algorithm

Get Complete Project Material File(s) Now! »

Tiling Code With Memory-Based Dependences

Introduction and related work

To guarantee that loop transformations are correct, the compiler needs to preserve data-dependences between statements. Two types of data dependences exist, true (also known as data-flow) and false (also know as memory-based) dependences [55]. False dependences have two types: output-dependences and anti-dependences. They are induced by the reuse of temporary vari-ables across statement instances. In the presence of these false dependences compilers may conservatively avoid the application of some loop transformations such as tiling. Tiling is pos-sible when a band 1 of consecutively nested loops is fully permutable. A band is permutable if it is possible to permute two loop levels with each other in that band. This is possible if all the dependences within that band of consecutive loop nests are forward dependences only, i.e., dependences with non-negative distance for every loop level within the band [18, 107]. The presence of backward false dependences, i.e., dependences with a negative distance in some loop levels, prevents tiling. Backward false dependences are actually very common, especially in code that defines and uses scalar variables or array elements as in the example in Figure 2.1 where backward dependences are induced by the reads and writes to .
In this work, we introduce a new validity criterion for tiling that allows the tiling of loop nests even in the presence of some patterns of false dependences as in the example of Fig-ure 2.1.
Scalar and array expansion [35] is a classical solution that can be used to remove false dependences and enable tiling. Expansion transforms scalars into arrays and transforms arrays into higher dimensional arrays and thus eliminates memory reuse. The terms expansion and privatization are sometimes used by the community interchangeably to refer to expansion. In this work we make a clear distinction between the two terms. In privatization [45, 63, 97], a private copy of the variables is created for each thread executing some iterations of the loop in parallel. In expansion, a private copy is created for each loop iteration [35]. This usage is consistent with the use of the two terms privatization and expansion in [68, 72]. In particular, privatization creates, for each thread cooperating on the execution of the loop, private copies of scalars (or arrays) [45, 63, 97]. The number of private copies is equal to the number of parallel threads. Although privatization is required to enable parallelism, it is not sufficient to enable tiling, because privatization only creates private copies of scalars and arrays for each thread, eliminating false dependences crossing the boundary of data-parallel blocks of iterations. Classical tiling techniques require the elimination of all (false) negative-distance dependences [18, 40, 49, 55], which is the domain of scalar or array expansion.
Although scalar and array expansion eliminates false dependences and enables loop tiling, it incurs high footprint on memory and degrades temporal locality [94]. Scalar expansion is particularly harmful as it converts register operations into memory operations [21]. A family of array contraction techniques attempts to reduce the memory footprint without constraining loop transformations [31, 61, 84]: the compiler performs a maximal expansion, applies loop transformations including tiling, and then attempts to contract the arrays. By performing a maximal expansion, false dependences are eliminated, but when unrestricted loop transforma-tions are applied after expansion, contraction is not always possible as the set of simultane-ously live values may effectively require high-dimensional arrays to store them. Loop distri-bution is an example of a widely applied loop transformation that prevents array contraction and thus disables the effectiveness of this technique as we show in an example in Section 2.7.
We propose and evaluate a technique allowing compilers to tile and parallelize loop nests, even in the presence of false dependences that would violate existing validity criteria for loop tiling. The technique enables the compiler to decide which false dependences can be safely ignored in the context of loop tiling. The proposed technique does not incur any costs of scalar or array expansion.

Sources of False Dependences 21

The main contributions are the following:
• We propose a relaxed permutability criterion that allows loop tiling to be applied with-out the need for any expansion or privatization, the impact on the memory footprint is minimized, and the overheads of array expansion are avoided.
• Our criterion allows the tiling of programs that privatization does not allow.
Section 2.2 presents the possible sources of false dependences. Section 2.3 shows that existing tiling algorithms are too restrictive and miss safe tiling opportunities. Section 2.4 introduces a new tiling test that avoids this problem and allows the tiling of kernels with false dependences. Section 2.5 proves that, by ignoring false dependences, tiling before 3AC transformation is equivalent to tiling after 3AC transformation. Sections 2.6 and 2.7 show an evaluation of the proposed technique.

Sources of False Dependences

One common source of false dependences is found in temporary variables introduced by pro-grammers in the body of a loop.
A second source of false dependences is the compiler itself. Even in source programs that do not initially contain scalar variables, compiler passes may introduce false dependences when upstream transformations introduce scalar variables. This is a practical compiler con-struction issue, and a high priority for the effectiveness of loop optimization frameworks im-plemented in production compilers [96]. Among upstream compiler passes generating false dependences, we will concentrate on the most critical ones, as identified by ongoing devel-opment on optimization frameworks such as the GRAPHITE polyhedral loop optimization framework in GCC [81, 95]:
• Transformation to Three-Address Code (3AC). Three-address code is an intermediate code used by compilers to aid in the implementation of optimizations. Each 3AC in-struction has at most three operands and is typically a combination of assignment and a binary operator. For example:
In 3AC transformation, each instruction that has more than three operands is trans-formed into a sequence of instructions, and temporary scalar variables are used to hold the values calculated by new instructions. The GRAPHITE polyhedral pass in GCC is an example of a loop optimization framework operating on low level three-address instructions. It is affected by the false dependences that result from a conversion to three-address code.
• Partial Redundancy Elimination (PRE) This is a transformation that can be applied to array operations to remove invariant loads and stores, and transform array accesses into scalar accesses [57, 74]. Figures 2.2 and 2.3 show an example of this optimization.
• Loop-invariant code motion is a common compiler optimization that moves loop-invariant code outside the loop body eliminating redundant calculations. An example is shown in Figures 2.4 and 2.5 where tiling of the outer two loops is inhibited by the extra depen-dences on after the application of loop-invariant code motion.

Motivating Example 23

In the case of GCC and LLVM, performing loop optimizations directly on three-address code provides many benefits. Mainly because at this level also the representation is in Static Single Assignment form (SSA), where each variable is assigned exactly once, and every vari-able is defined before it is used. Many compiler optimizations such as constant propagation, dead code elimination and partial redundancy elimination are performed on the code when it is in SSA form. The main advantage of this is the improved performance of the algorithms [55]. Loop transformations on three-address code also enable a tight integration of loop optimiza-tion techniques with the other compilation passes, including automatic vectorization, paral-lelization and memory optimizations. The ideal case for an optimization framework is to operate on a representation that is low enough so that this optimization framework benefits from all the passes operating on three-address code, and at the same time the optimization framework should be able to ignore spurious false dependences generated by these passes. Note that transforming the code into SSA form removes some false dependences, but does not remove all the false dependences in the program.
Loop tiling is of utmost importance to exploit temporal locality [18]. Unfortunately, this transformation is also highly sensitive to the presence of false dependences. Although it is possible to perform PRE and loop-invariant code motion after tiling in some compiler frame-works such as LLVM. Transformation to 3AC must happen before any loop optimization in order to benefit from all the passes operating on three-address code, and this limits the chances to apply tiling.
Finally, enabling tiling on 3AC automatically extends its applicability to “native 3AC lan-guages” and environments, such as Java bytecode and general-purpose accelerator languages such as PTX (which is a low-level parallel instruction set proposed by Nvidia). This may con-tribute to offering some level of performance portability to GPGPU programming and would have higher impact when integrated in the just-in-time compilers of languages such as PTX and Java bytecode [28].
Our goal is to propose and evaluate a technique allowing compilers to escape the problems described previously, providing a more relaxed criterion for loop tiling in the presence of false dependences. Such a validity criterion for tiling does not require a priori expansion of specific variables.

Motivating Example

The usual criterion to enable tiling is that none of the loops involved should carry a backward dependence (a dependence with a negative distance). The kernel presented in Figure 2.3 has backward dependences. These dependences stem from the fact that the same scalar is over-
24 Tiling Code With Memory-Based Dependences written in each iteration of the loop and enforce a sequential execution. In this example, the forward dependence condition prohibits the application of tiling, although the transformation is clearly valid as we show in Figure 2.6.
Our goal is to enable the compiler to perform tiling on code that contains false depen-dences, by ignoring false dependences when this is possible, and not by eliminating them through array and scalar expansion. In the next section, we introduce the concept of live range non-interference. We use this concept later to justify the correctness of the proposed technique and to decide when is it safe to ignore false dependences.

Live Range Non-Interference

In this section we present the concept of live range non-interference, which is then used to establish a relaxed validity criterion for tiling.

Live ranges

We borrow some terminology from the register allocation literature where live ranges are widely studied [20, 24, 46], but we extend the definition of live ranges to capture the execution instances of statements instead of the statements in the program. In a given sequential program, a value is said to be live between its definition instance and its last use instance. Since tiling may change the order of statement instances, we conservatively consider live ranges between a definition and any use, which includes the last use under any reordering. The definition instance is called the source of the live range, marking its beginning, and the use is called the sink, marking the end of the live range.

Live Range Non-Interference 25

This live range begins in the iteration i = 0, j = 0 and terminates in the same iteration.
Since the execution of a given program generally gives rise to a statically non-determinable number of live range instances, we cannot enumerate them individually, but rather we provide “classes” of live ranges defined with affine constraints. For example, the live range above is an instance of the following live range class where 0 ≤ i ≤ n ∧ 0 ≤ j ≤ n are affine constraints defining the domain of i and j. For each iteration of i and j, there is a live range that begins with the write statement S1 and terminates with the read statement S2. To be able to describe such classes using affine constraints, we only consider loop nests with static-affine control.

READ Access to the first deposited layer and to the crystallized material

Construction of live ranges

Live ranges form a subset of the read-after-write dependences. In the following, we assume that each sink in a dependence has only one source. We construct the live ranges using the array data-flow analysis described in [36] and implemented in the library [101] using parametric integer linear programming. Array data-flow analysis answers the following ques-tion: given a value v that is read from a memory cell m at a statement instance r, compute the statement instance w that is the source for the value v. The result is a (possibly non-convex) affine relation mapping read accesses to their source. The live ranges are then computed by reversing this relation, mapping sources to all their reads.
This systematic construction needs to be completed with the special treatment of live-in and live-out array elements. Array data-flow analysis provides the necessary information for live-in ranges, but inter-procedural analysis of array regions accessed outside the scope of the code being optimized is needed for live-out properties. This is an orthogonal problem, and for 26 Tiling Code With Memory-Based Dependences the moment we conservatively assume that the program is manually annotated with live-out arrays (the live-in and live-out scalars are explicit from the SSA form of the loop).

Live range non-interference

Any loop transformation is correct if it respects dataflow dependences and does not lead to live range interference (no two live ranges overlap) [95, 96, 99]. If we want to guarantee the non-interference of two live ranges, we have to make sure that the first live range is scheduled either before or after the second live range. Figure 2.8 shows two examples where pairs of live ranges do not interfere. Figure 2.9 shows two examples where these pairs of live ranges do interfere. We will now take a closer look at the conditions for preserving non-interference across loop tiling.
Tiling is possible when a band of consecutively nested loops is tilable. Using state-of-the-art tiling algorithms, the compiler looks for a band with forward dependences only for every level within the band [18, 107]. This can be done by calculating dependence directions for all dependences and for each loop level. A tilable loop band is composed of consecutive loop levels with no negative loop dependence. When a loop band is identified, dependences that are strongly satisfied by this band are dropped before starting the construction of a new inner loop band (a dependence is strongly satisfied if the sink of the dependence is scheduled strictly after the source of the dependence in one of the loop levels that are a part of that band). Loop levels that belong to the identified loop bands are tilable.
The main contribution of our work is a relaxation of this tilability criterion, based on a careful examination of which false dependences can be ignored. In particular, our criterion ignores anti-dependences (write-after-read dependences) that are adjacent to iteration-private live ranges. We say that a live range is iteration-private at level k if it begins and ends in the same iteration of a loop at level k.
We say that an anti-dependence and a live range are adjacent if the source of one of them is equal to the sink of the other and if the anti-dependence and live range derive from the same memory location. For example, in Figure 2.10, the dependence δS1→S2 is adjacent to the live ranges {S0 → S1} and {S2 → S3} but is not adjacent to {S4 → S5}. The dependence δS3→S4 is adjacent to {S2 → S3} and to {S4 → S5} but is not adjacent to {S0 → S1}.
Definition of the relaxed permutability criterion A band of loops is fully permutable if the relaxed permutability criterion holds on that band. The relaxed permutability criterion holds on a band if and only if for every dependence in the band one (or several) of the following conditions hold(s):
1. the dependence is forward for each level within the band;
2. it is an output-dependence between statement instances that only define values that are local to the region of code being analyzed, in the sense that the values defined are used inside the region and are not used after the region;
3. it is an anti-dependence, and it is adjacent only to live ranges that are iteration private within the band.
The first condition is inherited from the classical tiling criterion. The writes involved in the output dependences eliminated by the second condition are also involved in live ranges and are therefore guaranteed not to interfere with each other because of the third condition, as explained below. Moreover, the output dependences that are needed to ensure that values assigned by the last write to a given location (live-out values) will not be overwritten are not eliminated by the second condition. The third condition stems from the following observa-tions:
1. Any affine transformation—including tiling—is valid if it preserves data flow depen-dences and if it does not introduce live range interferences.
2. Tiling is composed of two basic transformations: loop strip-mining and loop inter-change.
• Strip-mining is always valid because it does not change the execution order of statement instances and thus it can be applied unconditionally.
• Loop interchange changes the order of iterations, hence may introduce live range interferences. But a closer look shows that it never changes the order of statement instances within one iteration of the loop body.
3. If live ranges are private to one iteration of a given loop (i.e., live ranges begin and terminate in the same iteration of that loop), changing the order of iterations of the present loop or any outer loop preserves the non-interference of live ranges.
4. We can ignore an anti-dependence only when it is adjacent to iteration-private live ranges (i.e., not adjacent to any non-iteration-private live range). This is correct for the following reason: an anti-dependence δ between two live ranges (S1, S2) and (S3, S4) is used to guarantee the non-interference of the two live ranges during loop transforma-tions. If the two live ranges (S1, S2) and (S3, S4) are iteration-private, then they will not interfere when tiling is applied, regardless of the presence of the anti-dependence δ . In other words, δ is useless in this case. And thus ignoring this anti-dependence during.

Live Range Non-Interference 29

tiling is safe as long as all statements in this live range are subject to the same affine transformation. If one of the live ranges is non-iteration-private, then there is no guaran-tee that they will not interfere during tiling, and thus, conservatively, we do not ignore the anti-dependence between the two live ranges in this case.
Comparison with the criterion for privatization Reformulated in our terminology, a variable can be privatized in a loop if all the live ranges of this variable are iteration private within that loop[68]. This condition is the same as the third condition of our relaxed permutability criterion, except that we only impose this condition on live ranges that are adjacent to back-ward anti-dependences. Our relaxed permutability criterion can therefore be considered as a hybrid of the classical permutability criterion and the privatization criterion. In particular, our criterion will allow tiling if either of these two criteria would allow tiling or privatization, but also if some dependences are covered by one criterion and the other dependences are covered by the other criterion.
In particular, the privatization criterion cannot privatize the scalar in the following exam-ple and thus some false dependences cannot be ignored and thus tiling cannot be applied.
With our criterion, the false dependences can be ignored and thus tiling can be applied.
The importance of adjacency Note that it is not safe to ignore an anti-dependence that is adjacent to a tile-private live range but that is also adjacent to a non-tile-private live range. Tiling is not valid in this case.

Illustrative examples

We consider again the mvt kernel of Figure 2.7. The false dependences created by the scalar inhibit the compiler from applying tiling, although tiling is valid for mvt.
generated by these statements. The original execution order is as follows: live range A of is executed first (this is the live range between S1 and S2 in the iteration i = 0, j = 0), then live range B is executed (this is the live range for iteration i = 0, j = 1), then C , D , E , F , G , etc.
We notice that each live range of is iteration-private. Each of these live ranges starts and terminates in the same iteration (no live range spans more than one iteration). The order of execution after applying a 2 × 2 tiling is as follows: ( A , B , E F ), ( C , D , G , H ). Tiling changes the order of execution of live ranges, but it does not break any live range of .
The false dependences generated by the array access do not inhibit loop tiling as all of these false dependences are forward on both levels. The live ranges of satisfy the first condition of the relaxed tilability criterion (they are all forward) and therefore also do not inhibit tiling.
Although tiling is valid in the case of mvt, the classical tiling algorithm would fail to tile this code because of the false dependences induced by . A similar reasoning applies to tiling of the gemm kernel shown in Figure 2.3: tiling is valid because the live ranges generated by the scalar are iteration-private.

An example where tiling is not valid

Consider the example in Figure 2.12. Similarly to the previous examples, the false depen-dences created by the scalar on the j loop prevent the compiler from applying loop tiling. But is it correct to ignore the false dependences?

Table of contents :

1 Introduction and Background
1.1 Introduction
1.2 Background and Notations
1.2.1 Maps and Sets
1.2.2 Additional Definitions
1.2.3 Polyhedral Representation of Programs
1.2.4 Data Locality
1.2.5 Loop Transformations
1.3 Outline
2 Tiling Code With Memory-Based Dependences
2.1 Introduction and related work
2.2 Sources of False Dependences
2.3 Motivating Example
2.4 Live Range Non-Interference
2.4.1 Live ranges
2.4.2 Construction of live ranges
2.4.3 Live range non-interference
2.4.4 Live range non-interference and tiling: a relaxed permutability criterion
2.4.5 Illustrative examples
2.4.6 Parallelism
2.5 Effect of 3AC on the Tilability of a Loop
2.6 Experimental Evaluation
2.7 Experimental Results
2.7.1 Evaluating the Effect of Ignoring False Dependences on Tiling
2.7.2 Performance evaluation for Polybench-3AC
2.7.3 Reasons for slowdowns
2.7.4 Performance evaluation for Polybench-PRE
2.8 Conclusion and Future Work
3 Scheduling Large Programs
3.1 Introduction
3.2 Definitions
3.3 Example of Statement Clustering
3.4 Clustering Algorithm
3.5 Using Offline Clustering with the Pluto Affine Scheduling Algorithm
3.5.1 Correctness of Loop Transformations after Clustering
3.6 Clustering Heuristics
3.6.1 Clustering Decision Validity
3.6.2 SCC Clustering
3.6.3 Basic-block Clustering
3.7 Experiments
3.8 Related Work
3.9 Conclusion and Future Work
4 The PENCIL Language
4.1 Introduction
4.2 PENCIL Language
4.2.1 Overview of PENCIL
4.2.2 PENCIL Definition as a Subset of C99
4.2.3 PENCIL Extensions to C99
4.3 Detailed Description
4.3.2 Description of the Memory Accesses of Functions
4.3.3 onst Function Attribute
4.3.4 Pragma Directives
4.3.5 Builtin Functions
4.4 Examples of PENCIL and non-PENCIL Code
4.4.1 BFS Example
4.4.2 Image Resizing Example
4.4.3 Recursive Data Structures and Recursive Function Calls
4.5 Polyhedral Compilation of PENCIL code
4.6 Related Work
5 Evaluating PENCIL
5.1 Introduction
5.2 Benchmarks
5.2.1 Image Processing Benchmark Suite
5.2.2 Rodinia and SHOC Benchmark Suites
5.2.3 VOBLA DSL for Linear Algebra
5.2.4 SpearDE DSL for Data-Streaming Applications
5.3 Discussion of the Results
5.4 Conclusion and Future Work
6 Conclusions and Perspectives
6.1 Contributions
6.2 Future Directions