Compilation flow for the Integrated Programmable Array Architecture

Get Complete Project Material File(s) Now! »

Data Level Parallelism (DLP)

The computational resources in this approach operate on regular data structures such as one and two-dimensional arrays, where the computational resources operate on each element of the data structure in parallel. The compilers target accelerating DLP loops, vector processing or SIMD mode of operation. CGRAs like Morphosys, Remarc, PADDI leverage SIMD architecture. The compilers for these architectures target to exploit DLP in the applications. However, DLP-only accelerators face performance issues while the accelerating region does not have any DLP, i.e. there are inter iteration data dependency.

Instruction Level Parallelism (ILP)

As for the compute intensive applications, nested loops perform computations on arrays of data, that can provide a lot of ILP. For this reason, most of the compilers tend to exploit ILP for the underlying CGRA architecture.
State of the art compilers which tend to exploit the ILP, like RegiMap [44], DRESC [72], Edge Centric Modulo Scheduling (EMS) [79] mostly rely on software pipelining. This approach can manage to map the innermost loop body in a pipelined manner. On the other hand, for the outer loops, CPU must initiate each iteration in the CGRA, which causes significant overhead in the synchronization between the CGRA and CPU execution. Liu et al in [66] pinpointed this issue and proposed to map maximum of two levels of loops using polyhedral transformation on the loops. However, this approach is not generic as it cannot scale to an arbitrary number of loops. Some approaches [65] [61] use loop unrolling for the kernels. The basic assumption for these implementations is that the innermost loop’s trip count is not large. Hence, the solutions end up doing partial unroll of the innermost loops. The outer loops remain to be executed by the host processor. As most of the proposed compilers handle innermost loop of the kernels, they mostly bank upon the partial predication [47] [13] and full predication [4] techniques to map the conditionals inside the loop body.
Partial predication maps instructions of both if-part and else-part on different PEs. If both the if-part and the else-part update the same variable, the result is computed by selecting the output from the path that must have been executed based on the evaluation of the branch condition. This technique increases the utilization of the PEs, at the cost of higher energy consumption due to execution of both paths in a conditional. Unlike partial predication, in full predication all instructions are predicated. Instructions on each path of a control flow, which are sequentially configured onto PEs, will be executed if the predicate value of the instruction is similar with the flag in the PEs. Hence, the instructions in the false path do not get executed. The sequential arrangement of the paths degrades the latency and energy efficiency of this technique.
Full predication is upgraded in state based full predication [46]. This scheme prevents the wasted instruction issues from false conditional path by introducing sleep and awake mechanisms but fails to improve performance. Dual issue scheme [45] targets energy efficiency by issuing two instructions to a PE simultaneously, one from the if-path, another from the else-path. In this mechanism, the latency remains similar to that of the partial predication with improved energy efficiency. However, this approach is too restrictive, as far as imbalanced and nested conditionals are concerned. To map nested, imbalanced conditionals and single loop onto CGRA, the triggered long instruction set architecture (TLIA) is presented in [67]. This approach merges all the conditions present in kernels into triggered instructions and creates instruction pool for each triggered instruction. As the depth of the nested conditionals increases the performance of this approach decreases. As far as the loop nests are concerned, the TLIA approach reaches bottleneck to accommodate the large set of triggered instructions into the limited set of PEs.

Homomorphism

The basic blocks in the CDFG, presented in Figure 3.1(c), are composed of data nodes, operation nodes, and data dependencies. Three equivalences between the basic block DFGs and PEA model nodes are defined: (1) data and registers; (2) computation and computing operators; (3) data dependences and connection between the time extended PE components. As the two models are homomorphic, the mapping of a DFG onto the PEA is therefore a problem equivalent to finding a DFG in the PEA graph.

Supporting Control Flow

One of the major challenges associated with all accelerators is to effectively handle control flow in the applications. Since the goal of the compiler presented in this chapter is to execute a complete program efficiently onto a CGRA, by control flow, we do not only mean the conditionals which are present inside a loop body, but any conditional or unconditional branch in general. For better understanding, we classify the control flow into three categories as presented in Figure 3.3. The unconditional branches can be optimized by merging basic blocks or straightening which is applicable to pairs of basic blocks such that the first has no successors other than the second and the second has no predecessors other than the first. If there exists more than one basic block in a program after optimization, which is often the case, the underlying accelerator must support unconditional branch to avoid host interference.
The fundamental problem for the conditionals are outcome of the branch at runtime. Hence, effective resource allocation is a problem. Hardware accelerators and FPGAs executes both the paths of a conditional branch in parallel, and then choose the results of the true path. This results in waste of resources and power. GP-GPUs also schedule the instructions and allocated resources for both the paths of the conditionals, but at the runtime, instructions from the false path are not issued. This saves power, but the cycles and resources allocated for the not-taken path are still wasted. In the graphics processing community, this is referred to as the problem of branch divergence. CGRAs widely use predication techniques to deal with the conditionals. Fundamentally partial, and full predication are adapted by the compilers, which are now discussed along with some other notable schemes.

READ Further Development of a Rescue Helmet

CDFG mapping

First, we formulate the problem of CDFG mapping and propose a register allocation based solution accordingly. Subsequently, we discuss the steps involving the mapping of CDFG.
Data in an application is separated into two categories.
1. The standard input and output data (mostly the array inputs and outputs) are mapped as memory operands. The inputs and outputs are allotted by load-store operations. In our sample program in Figure 3.2, m, n are the input arrays and p are the output array, which are managed by load and store operations.
2. The internal variables of a program are mapped onto the registers of the processing elements, and managed by the register allocation based approach [23].
Following, we introduce several definitions concerning register allocation approach:
Definition 3.2.1. Symbol Variables and location constraints: In compilation, the recurring variables (repeatedly written and read) are managed in local register files of the PEs to avoid multiple access of local memory. The recurring variables which have occurrences in multiple basic blocks need special attention since the integrity of these variables must be kept intact throughout the mapping process for different basic blocks. These variables are defined as Symbol variables. The register locations for the symbol variables are referred to as location constraints. For instance, variable c in the CDFG (Fig. 3.2) is written in BB_3, and read in BB_4, BB_5 and BB_6. In mapping all these basic blocks the register location for c must be same. Similarly, X1, X2, X3, X4, X5, i, a and b must be location constrained. The locations for such symbol variables are denoted with an overline, as variable_name.

Table of contents :

List of figures
List of tables
Introduction
1 Background and Related Work
1.1 Design Space
1.1.1 Computational Resources
1.1.2 Interconnection Network
1.1.3 Reconfigurability
1.1.4 Register Files
1.1.5 Memory Management
1.2 Compiler Support
1.2.1 Data Level Parallelism (DLP)
1.2.2 Instruction Level Parallelism (ILP)
1.2.3 Thread Level Parallelism
1.3 Mapping .
1.4 Representative CGRAs
1.4.1 MorphoSys
1.4.2 ADRES .
1.4.3 RAW .
1.4.4 TCPA .
1.4.5 PACT XPP .
1.5 Conclusion .
2 Design of The Reconfigurable Accelerator
2.1 Design Choices
2.2 Integrated Programmable Array Architecture
2.2.1 IPA components
2.2.2 Computation Model
2.3 Conclusion .
3 Compilation flow for the Integrated Programmable Array Architecture
3.1 Background .
3.1.1 Architecture model
3.1.2 Application model
3.1.3 Homomorphism
3.1.4 Supporting Control Flow
3.2 Compilation flow
3.2.1 DFG mapping
3.2.2 CDFG mapping
3.2.3 Assembler
3.3 Conclusion .
4 IPA performance evaluation
4.1 Implementation of the IPA
4.1.1 Area Results
4.1.2 Memory Access Optimization
4.1.3 Comparison with low-power CGRA architectures
4.2 Compilation
4.2.1 Performance evaluation of the compilation flow
4.2.2 Comparison of the register allocation approach with state of the art predication techniques
4.2.3 Compiling smart visual trigger application
4.3 Conclusion .
5 The Heterogeneous Parallel Ultra-Low-Power Processing-Platform (PULP) Cluster
5.1 PULP heterogeneous architecture
5.1.1 PULP SoC overview
5.1.2 Heterogeneous Cluster
5.2 Software infrastructure
5.3 Implementation and Benchmarking
5.3.1 Implementation Results
5.3.2 Performance and Energy Consumption Results
5.4 Conclusion .
Summary and Future work
References