Tiling Code With Memory-Based Dependences

Get Complete Project Material File(s) Now! »

The main contributions are the following:

• We propose a relaxed permutability criterion that allows loop tiling to be applied with-out the need for any expansion or privatization, the impact on the memory footprint is minimized, and the overheads of array expansion are avoided.
• Our criterion allows the tiling of programs that privatization does not allow.
Section 2.2 presents the possible sources of false dependences. Section 2.3 shows that existing tiling algorithms are too restrictive and miss safe tiling opportunities. Section 2.4 introduces a new tiling test that avoids this problem and allows the tiling of kernels with false dependences. Section 2.5 proves that, by ignoring false dependences, tiling before 3AC transformation is equivalent to tiling after 3AC transformation. Sections 2.6 and 2.7 show an evaluation of the proposed technique.

Sources of False Dependences

One common source of false dependences is found in temporary variables introduced by pro-grammers in the body of a loop.
A second source of false dependences is the compiler itself. Even in source programs that do not initially contain scalar variables, compiler passes may introduce false dependences when upstream transformations introduce scalar variables. This is a practical compiler con-struction issue, and a high priority for the effectiveness of loop optimization frameworks im-plemented in production compilers [96]. Among upstream compiler passes generating false dependences, we will concentrate on the most critical ones, as identified by ongoing devel-opment on optimization frameworks such as the GRAPHITE polyhedral loop optimization framework in GCC [81, 95]:
• Transformation to Three-Address Code (3AC). Three-address code is an intermediate code used by compilers to aid in the implementation of optimizations. Each 3AC in-struction has at most three operands and is typically a combination of assignment and a binary operator. For example:
In 3AC transformation, each instruction that has more than three operands is trans-formed into a sequence of instructions, and temporary scalar variables are used to hold the values calculated by new instructions. The GRAPHITE polyhedral pass in GCC is an example of a loop optimization framework operating on low level three-address instructions. It is affected by the false dependences that result from a conversion to three-address code.
• Partial Redundancy Elimination (PRE) This is a transformation that can be applied to array operations to remove invariant loads and stores, and transform array accesses into scalar accesses [57, 74]. Figures 2.2 and 2.3 show an example of this optimization.
• Loop-invariant code motion is a common compiler optimization that moves loop-invariant code outside the loop body eliminating redundant calculations. An example is shown in Figures 2.4 and 2.5 where tiling of the outer two loops is inhibited by the extra depen-dences on after the application of loop-invariant code motion.
In the case of GCC and LLVM, performing loop optimizations directly on three-address code provides many benefits. Mainly because at this level also the representation is in Static Single Assignment form (SSA), where each variable is assigned exactly once, and every vari-able is defined before it is used. Many compiler optimizations such as constant propagation, dead code elimination and partial redundancy elimination are performed on the code when it is in SSA form. The main advantage of this is the improved performance of the algorithms [55]. Loop transformations on three-address code also enable a tight integration of loop optimiza-tion techniques with the other compilation passes, including automatic vectorization, paral-lelization and memory optimizations. The ideal case for an optimization framework is to operate on a representation that is low enough so that this optimization framework benefits from all the passes operating on three-address code, and at the same time the optimization framework should be able to ignore spurious false dependences generated by these passes. Note that transforming the code into SSA form removes some false dependences, but does not remove all the false dependences in the program.
Loop tiling is of utmost importance to exploit temporal locality [18]. Unfortunately, this transformation is also highly sensitive to the presence of false dependences. Although it is possible to perform PRE and loop-invariant code motion after tiling in some compiler frame-works such as LLVM. Transformation to 3AC must happen before any loop optimization in order to benefit from all the passes operating on three-address code, and this limits the chances to apply tiling.
Finally, enabling tiling on 3AC automatically extends its applicability to “native 3AC lan-guages” and environments, such as Java bytecode and general-purpose accelerator languages such as PTX (which is a low-level parallel instruction set proposed by Nvidia). This may con-tribute to offering some level of performance portability to GPGPU programming and would have higher impact when integrated in the just-in-time compilers of languages such as PTX and Java bytecode [28].
Our goal is to propose and evaluate a technique allowing compilers to escape the problems described previously, providing a more relaxed criterion for loop tiling in the presence of false dependences. Such a validity criterion for tiling does not require a priori expansion of specific variables.

Motivating Example

The usual criterion to enable tiling is that none of the loops involved should carry a backward dependence (a dependence with a negative distance). The kernel presented in Figure 2.3 has backward dependences. These dependences stem from the fact that the same scalar is over-written in each iteration of the loop and enforce a sequential execution. In this example, the forward dependence condition prohibits the application of tiling, although the transformation is clearly valid as we show in Figure 2.6.
Our goal is to enable the compiler to perform tiling on code that contains false depen-dences, by ignoring false dependences when this is possible, and not by eliminating them through array and scalar expansion. In the next section, we introduce the concept of live range non-interference. We use this concept later to justify the correctness of the proposed technique and to decide when is it safe to ignore false dependences.

Live Range Non-Interference

In this section we present the concept of live range non-interference, which is then used to establish a relaxed validity criterion for tiling.

Live ranges

We borrow some terminology from the register allocation literature where live ranges are widely studied [20, 24, 46], but we extend the definition of live ranges to capture the execution instances of statements instead of the statements in the program. In a given sequential program, a value is said to be live between its definition instance and its last use instance. Since tiling may change the order of statement instances, we conservatively consider live ranges between a definition and any use, which includes the last use under any reordering. The definition instance is called the source of the live range, marking its beginning, and the use is called the sink, marking the end of the live range.
{Sk (I) → Sk′ (I′)} defines an individual live range beginning with an instance of the write statement Sk and ending with an instance of the read statement Sk′ . I and I′ are iteration vectors dentifying the specific instances of the two statements. For example, the scalar of the mvt kernel shown in Figure 2.7 induces several live ranges, including {S1[0,0] → S2[0,0]}
This live range begins in the iteration i = 0, j = 0 and terminates in the same iteration.
Since the execution of a given program generally gives rise to a statically non-determinable number of live range instances, we cannot enumerate them individually, but rather we provide “classes” of live ranges defined with affine constraints. For example, the live range above is an instance of the following live range class {S1[i, j] → S2[i, j] : 0 ≤ i ≤ n ∧ 0 ≤ j ≤ n}, where 0 ≤ i ≤ n ∧ 0 ≤ j ≤ n are affine constraints defining the domain of i and j. For each iteration of i and j, there is a live range that begins with the write statement S1 and terminates with the read statement S2. To be able to describe such classes using affine constraints, we only consider loop nests with static-affine control.

Construction of live ranges

Live ranges form a subset of the read-after-write dependences. In the following, we assume that each sink in a dependence has only one source. We construct the live ranges using the array data-flow analysis described in [36] and implemented in the library [101] using parametric integer linear programming. Array data-flow analysis answers the following ques-tion: given a value v that is read from a memory cell m at a statement instance r, compute the statement instance w that is the source for the value v. The result is a (possibly non-convex) affine relation mapping read accesses to their source. The live ranges are then computed by reversing this relation, mapping sources to all their reads.
This systematic construction needs to be completed with the special treatment of live-in and live-out array elements. Array data-flow analysis provides the necessary information for live-in ranges, but inter-procedural analysis of array regions accessed outside the scope of the code being optimized is needed for live-out properties. This is an orthogonal problem, and for the moment we conservatively assume that the program is manually annotated with live-out arrays (the live-in and live-out scalars are explicit from the SSA form of the loop).

Live range non-interference

Any loop transformation is correct if it respects dataflow dependences and does not lead to live range interference (no two live ranges overlap) [95, 96, 99]. If we want to guarantee the non-interference of two live ranges, we have to make sure that the first live range is scheduled either before or after the second live range. Figure 2.8 shows two examples where pairs of live ranges do not interfere. Figure 2.9 shows two examples where these pairs of live ranges do interfere. We will now take a closer look at the conditions for preserving non-interference across loop tiling.

Live range non-interference and tiling: a relaxed permutability criterion

Tiling is possible when a band of consecutively nested loops is tilable. Using state-of-the-art tiling algorithms, the compiler looks for a band with forward dependences only for every level within the band [18, 107]. This can be done by calculating dependence directions for all dependences and for each loop level. A tilable loop band is composed of consecutive loop levels with no negative loop dependence. When a loop band is identified, dependences that are strongly satisfied by this band are dropped before starting the construction of a new inner loop band (a dependence is strongly satisfied if the sink of the dependence is scheduled strictly after the source of the dependence in one of the loop levels that are a part of that band). Loop levels that belong to the identified loop bands are tilable.
The main contribution of our work is a relaxation of this tilability criterion, based on a careful examination of which false dependences can be ignored. In particular, our criterion ignores anti-dependences (write-after-read dependences) that are adjacent to iteration-private live ranges. We say that a live range is iteration-private at level k if it begins and ends in the same iteration of a loop at level k.
We say that an anti-dependence and a live range are adjacent if the source of one of them is equal to the sink of the other and if the anti-dependence and live range derive from the same memory location. For example, in Figure 2.10, the dependence δS1→S2 is adjacent to the live ranges {S0 → S1} and {S2 → S3} but is not adjacent to {S4 → S5}. The dependence δS3→S4 is adjacent to {S2 → S3} and to {S4 → S5} but is not adjacent to {S0 → S1}.
Definition of the relaxed permutability criterion A band of loops is fully permutable if the relaxed permutability criterion holds on that band. The relaxed permutability criterion holds on a band if and only if for every dependence in the band one (or several) of the following conditions hold(s):
1. the dependence is forward for each level within the band;
2. it is an output-dependence between statement instances that only define values that are local to the region of code being analyzed, in the sense that the values defined are used inside the region and are not used after the region;
3. it is an anti-dependence, and it is adjacent only to live ranges that are iteration private within the band.
The first condition is inherited from the classical tiling criterion. The writes involved in the output dependences eliminated by the second condition are also involved in live ranges and are therefore guaranteed not to interfere with each other because of the third condition, as explained below. Moreover, the output dependences that are needed to ensure that values assigned by the last write to a given location (live-out values) will not be overwritten are not eliminated by the second condition. The third condition stems from the following observa-tions:
1. Any affine transformation—including tiling—is valid if it preserves data flow depen-dences and if it does not introduce live range interferences.
2. Tiling is composed of two basic transformations: loop strip-mining and loop inter-change.
• Strip-mining is always valid because it does not change the execution order of statement instances and thus it can be applied unconditionally.
• Loop interchange changes the order of iterations, hence may introduce live range interferences. But a closer look shows that it never changes the order of statement instances within one iteration of the loop body.
3. If live ranges are private to one iteration of a given loop (i.e., live ranges begin and terminate in the same iteration of that loop), changing the order of iterations of the present loop or any outer loop preserves the non-interference of live ranges.
4. We can ignore an anti-dependence only when it is adjacent to iteration-private live ranges (i.e., not adjacent to any non-iteration-private live range). This is correct for the following reason: an anti-dependence δ between two live ranges (S1, S2) and (S3, S4)
is used to guarantee the non-interference of the two live ranges during loop transforma-tions. If the two live ranges (S1, S2) and (S3, S4) are iteration-private, then they will not interfere when tiling is applied, regardless of the presence of the anti-dependence δ . In other words, δ is useless in this case. And thus ignoring this anti-dependence during tiling is safe as long as all statements in this live range are subject to the same affine transformation. If one of the live ranges is non-iteration-private, then there is no guaran-tee that they will not interfere during tiling, and thus, conservatively, we do not ignore the anti-dependence between the two live ranges in this case.
Comparison with the criterion for privatization Reformulated in our terminology, a variable can be privatized in a loop if all the live ranges of this variable are iteration private within that loop[68]. This condition is the same as the third condition of our relaxed permutability criterion, except that we only impose this condition on live ranges that are adjacent to back-ward anti-dependences. Our relaxed permutability criterion can therefore be considered as a hybrid of the classical permutability criterion and the privatization criterion. In particular, our criterion will allow tiling if either of these two criteria would allow tiling or privatization, but also if some dependences are covered by one criterion and the other dependences are covered by the other criterion.
In particular, the privatization criterion cannot privatize the scalar in the following exam-ple and thus some false dependences cannot be ignored and thus tiling cannot be applied.
With our criterion, the false dependences can be ignored and thus tiling can be applied.
The importance of adjacency Note that it is not safe to ignore an anti-dependence that is adjacent to a tile-private live range but that is also adjacent to a non-tile-private live range. Tiling is not valid in this case.

READ Believability of virtual players

Illustrative examples

i. The mvt and gemm kernels
We consider again the mvt kernel of Figure 2.7. The false dependences created by the scalar inhibit the compiler from applying tiling, although tiling is valid for mvt generated by these statements. The original execution order is as follows: live range A of is executed first (this is the live range between S1 and S2 in the iteration i = 0, j = 0), then live range B is executed (this is the live range for iteration i = 0, j = 1), then C , D , E , F , G , etc.
We notice that each live range of is iteration-private. Each of these live ranges starts and terminates in the same iteration (no live range spans more than one iteration). The order of execution after applying a 2 × 2 tiling is as follows: ( A , B , E F ), ( C , D , G , H ). Tiling changes the order of execution of live ranges, but it does not break any live range of .
The false dependences generated by the array access do not inhibit loop tiling as all of these false dependences are forward on both levels. The live ranges of satisfy the first condition of the relaxed tilability criterion (they are all forward) and therefore also do not inhibit tiling.
Although tiling is valid in the case of mvt, the classical tiling algorithm would fail to tile this code because of the false dependences induced by . A similar reasoning applies to tiling of the gemm kernel shown in Figure 2.3: tiling is valid because the live ranges generated by the scalar are iteration-private.
ii. An example where tiling is not valid
Consider the example in Figure 2.12. Similarly to the previous examples, the false depen-dences created by the scalar on the j loop prevent the compiler from applying loop tiling. But is it correct to ignore the false dependences?
E , F , G , H . After a 2 × 2 tiling, the new execution order is: A , B , E , F , etc. This means that the value written in B on the scalar will be overwritten by the write statement in E . This tiling breaks the live range {S2(I) → S2(I′)}.
Tiling in this case is not valid because the live ranges are not iteration-private and thus we should not ignore the false dependences on .

Parallelism

Besides tiling, we are also interested in the extraction of data parallelism and in generating OpenMP directives for parallelism. In order to parallelize a loop, the variables inducing false dependences in this loop need to be privatized.
The privatization of variables for parallelism is much less intrusive than actual scalar or array expansion. The former only makes variables thread-private, not impacting address computation and code generation inside the loop body, while the latter involves generating new ar-ray data declarations, rewriting array subscripts, and recovering the data flow across renamed privatized structures [36]. In addition, marking a scalar variable as thread-private does not result in the conversion of register operands into memory accesses, unlike the expansion of a scalar variable along an enclosing loop. In practice, support for privatization is implemented as a slight extension to the state-of-the-art violated dependence analysis [96, 100].
Figure 2.14 compares between the number of private copies that have to be created for a given scalar (or array) in order to enable parallelization, to enable tiling or to enable both. N is the number of loop iterations and T is the number of parallel threads that execute the loop. In general T is significantly smaller than N.
Interestingly, the transformation of source code to three address code does not impact parallelism detection as it only induces intra-block live ranges. Section 2.5 proves a similar result about the impact of our technique on the applicability of loop tiling.

Effect of 3AC on the Tilability of a Loop

This section investigates the effect of 3AC transformation on the tilability of a loop (i.e., on whether we can tile it or not). Can the transformation of a code into 3AC from result in dependences that prevent tiling?
We use our relaxed permutability criterion to show that, in fact, the transformation of code into 3AC form does not have any effect on the tilability of a loop. In other words, whether tiling is applied before or after three-address lowering, we always get the same result. Consider the following example:
Let Δ be the set of all dependences induced by S1. After transforming this statement to 3AC, we get two new statements: S2 and S3.
Variable introduced by the three-address transformation is a new variable and thus it does not have any dependence with the other variables of the program (the variables that were already in the program before three-address transformation), but it induces three new dependences between the two newly introduced statements. These dependences are:
• A flow dependence: δS2→S3 ;
• A write-after-write dependence: δS2→S2 ;
• A write-after-read dependence: δS3→S2 ;
Before three-address transformation the set of dependences was Δ. After three-address transformation the set of dependences is Δ ∪ δS2→S3 ∪ δS2→S2 ∪ δS3→S2 .
The flow dependence δS2→S3 constitutes a set of live ranges that are iteration-private (they begin and terminate in the same iteration) and thus, by applying our relaxed permutability criterion, we can ignore the dependences δS2→S2 and δS3→S2 during tiling. The dependence that remains is Δ ∪ δS2→S3 . Since this dependence is iteration-private (because its dependence distance is zero), it will have no effect on tiling2, and thus applying tiling on the transformed code is exactly equivalent to applying tiling on the original source-level code (since in both cases the ability to apply tiling will depend on the set of dependences Δ).

Table of contents :

1 Introduction and Background
1.1 Introduction
1.2 Background and Notations
1.2.1 Maps and Sets
1.2.2 Additional Definitions
1.2.3 Polyhedral Representation of Programs
1.2.4 Data Locality
1.2.5 Loop Transformations
1.3 Outline
2 Tiling Code With Memory-Based Dependences
2.1 Introduction and related work
2.2 Sources of False Dependences
2.3 Motivating Example
2.4 Live Range Non-Interference
2.4.1 Live ranges
2.4.2 Construction of live ranges
2.4.3 Live range non-interference
2.4.4 Live range non-interference and tiling: a relaxed permutability criterion
2.4.5 Illustrative examples
2.4.6 Parallelism
2.5 Effect of 3AC on the Tilability of a Loop
2.6 Experimental Evaluation
2.7 Experimental Results
2.7.1 Evaluating the Effect of Ignoring False Dependences on Tiling
2.7.2 Performance evaluation for Polybench-3AC
2.7.3 Reasons for slowdowns
2.7.4 Performance evaluation for Polybench-PRE
2.8 Conclusion and Future Work
3 Scheduling Large Programs
3.1 Introduction
3.2 Definitions
3.3 Example of Statement Clustering
3.4 Clustering Algorithm
3.5 Using Offline Clustering with the Pluto Affine Scheduling Algorithm
3.5.1 Correctness of Loop Transformations after Clustering
3.6 Clustering Heuristics
3.6.1 Clustering Decision Validity
3.6.2 SCC Clustering
3.6.3 Basic-block Clustering
3.7 Experiments
3.8 Related Work
3.9 Conclusion and Future Work
4 The PENCIL Language
4.1 Introduction
4.2 PENCIL Language
4.2.1 Overview of PENCIL
4.2.2 PENCIL Definition as a Subset of C99
4.2.3 PENCIL Extensions to C99
4.3 Detailed Description
4.3.1 stati
4.3.2 Description of the Memory Accesses of Functions
4.3.3 onst Function Attribute
4.3.4 Pragma Directives
4.3.5 Builtin Functions
4.4 Examples of PENCIL and non-PENCIL Code
4.4.1 BFS Example
4.4.2 Image Resizing Example
4.4.3 Recursive Data Structures and Recursive Function Calls
4.5 Polyhedral Compilation of PENCIL code
4.6 Related Work
5 Evaluating PENCIL
5.1 Introduction
5.2 Benchmarks
5.2.1 Image Processing Benchmark Suite
5.2.2 Rodinia and SHOC Benchmark Suites
5.2.3 VOBLA DSL for Linear Algebra
5.2.4 SpearDE DSL for Data-Streaming Applications
5.3 Discussion of the Results
5.4 Conclusion and Future Work
6 Conclusions and Perspectives
6.1 Contributions
6.2 Future Directions
Bibliography
A EBNF Grammar for PENCIL
B Personal Publications
Index