Compilation flow for the Integrated Programmable Array Architecture

Get Complete Project Material File(s) Now! »

Data Level Parallelism (DLP)

The computational resources in this approach operate on regular data structures such as one and two-dimensional arrays, where the computational resources operate on each element of the data structure in parallel. The compilers target accelerating DLP loops, vector processing or SIMD mode of operation. CGRAs like Morphosys, Remarc, PADDI leverage SIMD architecture. The compilers for these architectures target to exploit DLP in the applications. However, DLP-only accelerators face performance issues while the accelerating region does not have any DLP, i.e. there are inter iteration data dependency.

Instruction Level Parallelism (ILP)

As for the compute intensive applications, nested loops perform computations on arrays of data, that can provide a lot of ILP. For this reason, most of the compilers tend to exploit ILP for the underlying CGRA architecture.
State of the art compilers which tend to exploit the ILP, like RegiMap [45], DRESC [74], Edge Centric Modulo Scheduling (EMS) [81] mostly rely on software pipelining. This approach can manage to map the innermost loop body in a pipelined manner. On the other hand, for the outer loops, CPU must initiate each iteration in the CGRA, which causes significant overhead in the synchronization between the CGRA and CPU execution. Liu et al in [67] pinpointed this issue and proposed to map maximum of two levels of loops using polyhedral transformation on the loops. However, this approach is not generic as it cannot scale to an arbitrary number of loops. Some approaches [66] [62] use loop unrolling for the kernels. The basic assumption for these implementations is that the innermost loop’s trip count is not large. Hence, the solutions end up doing partial unroll of the innermost loops. The outer loops remain to be executed by the host processor. As most of the proposed compilers handle innermost loop of the kernels, they mostly bank upon the partial predication [48] [13] and full predication [4] techniques to map the conditionals inside the loop body.
Partial predication maps instructions of both if-part and else-part on different PEs. If both the if-part and the else-part update the same variable, the result is computed by selecting the output from the path that must have been executed based on the evaluation of the branch condition. This technique increases the utilization of the PEs, at the cost of higher energy consumption due to execution of both paths in a conditional. Unlike partial predication, in full predication all instructions are predicated. Instructions on each path of a control flow, which are sequentially configured onto PEs, will be executed if the predicate value of the instruction is similar with the flag in the PEs. Hence, the instructions in the false path do not get executed. The sequential arrangement of the paths degrades the latency and energy efficiency of this technique.
Full predication is upgraded in state based full predication [47]. This scheme prevents the wasted instruction issues from false conditional path by introducing sleep and awake mechanisms but fails to improve performance. Dual issue scheme [46] targets energy efficiency by issuing two instructions to a PE simultaneously, one from the if-path, another from the else-path. In this mechanism, the latency remains similar to that of the partial predication with improved energy efficiency. However, this approach is too restrictive, as far as imbalanced and nested conditionals are concerned. To map nested, imbalanced conditionals and single loop onto CGRA, the triggered long instruction set architecture (TLIA) is presented in [68]. This approach merges all the conditions present in kernels into triggered instructions and creates instruction pool for each triggered instruction. As the depth of the nested conditionals increases the performance of this approach decreases. As far as the loop nests are concerned, the TLIA approach reaches bottleneck to accommodate the large set of triggered instructions into the limited set of PEs.

Thread Level Parallelism

To exploit TLP, compilers partition the program into multiple parallel threads, each of which is then mapped onto a set of PEs. Compilers for RAW, PACT, KressArray leverage on TLP. To support parallel execution modes, the controller must be extended for supporting the call stack and synchronizing the threads. As a result, power consumption is increased.
The TRIPS controller supports four operation modes of operation to support all the types of parallelism [96]. The first mode is configured to execute single thread in all the PEs, exploiting ILP. In the second mode, the four rows execute four independent threads exploiting TLP. In the third mode, fine-grained multi-threading is supported by time-multiplexing all PEs over multiple threads. In the fourth mode each PE of a row executes the same operation, thus implementing SIMD, exploiting DLP. Thus, the TRIPS compiler can exploit the most suited form of parallelism. The compiler for REDEFINE exploits TLP and DLP to accelerate a set of HPC applications. Table 1.2 presents an overview of several architectural and compilation aspects of the state of the art CGRA designs.

Supporting Control Flow

One of the major challenges associated with all accelerators is to effectively handle control flow in the applications. Since the goal of the compiler presented in this chapter is to execute a complete program efficiently onto a CGRA, by control flow, we do not only mean the conditionals which are present inside a loop body, but any conditional or unconditional branch in general. For better understanding, we classify the control flow into three categories as presented in Figure 3.3. The unconditional branches can be optimized by merging basic blocks or straightening which is applicable to pairs of basic blocks such that the first has no successors other than the second and the second has no predecessors other than the first. If there exists more than one basic block in a program after optimization, which is often the case, the underlying accelerator must support unconditional branch to avoid host interference.
The fundamental problem for the conditionals are outcome of the branch at runtime. Hence, effective resource allocation is a problem. Hardware accelerators and FPGAs executes both the paths of a conditional branch in parallel, and then choose the results of the true path. This results in waste of resources and power. GP-GPUs also schedule the instructions and allocated resources for both the paths of the conditionals, but at the runtime, instructions from the false path are not issued. This saves power, but the cycles and resources allocated for the not-taken path are still wasted. In the graphics processing community, this is referred to as the problem of branch divergence. CGRAs widely use predication techniques to deal with the conditionals. Fundamentally partial, and full predication are adapted by the compilers, which are now discussed along with some other notable schemes.

Table of contents :

List of figures
List of tables
Introduction
1 Background and Related Work
1.1 Design Space
1.1.1 Computational Resources
1.1.2 Interconnection Network
1.1.3 Reconfigurability
1.1.4 Register Files
1.1.5 Memory Management
1.2 Compiler Support
1.2.1 Data Level Parallelism (DLP)
1.2.2 Instruction Level Parallelism (ILP)
1.2.3 Thread Level Parallelism
1.3 Mapping
1.4 Representative CGRAs
1.4.1 MorphoSys
1.4.2 ADRES
1.4.3 RAW
1.4.4 TCPA
1.4.5 PACT XPP
1.5 Conclusion
2 Design of The Reconfigurable Accelerator
2.1 Design Choices
2.2 Integrated Programmable Array Architecture
2.2.1 IPA components
2.2.2 Computation Model
2.3 Conclusion
3 Compilation flow for the Integrated Programmable Array Architecture
3.1 Background
3.1.1 Architecture model
3.1.2 Application model
3.1.3 Homomorphism
3.1.4 Supporting Control Flow
3.2 Compilation flow
3.2.1 DFG mapping
3.2.2 CDFG mapping
3.2.3 Assembler
3.3 Conclusion
4 IPA performance evaluation
4.1 Implementation of the IPA
4.1.1 Area Results
4.1.2 Memory Access Optimization
4.1.3 Comparison with low-power CGRA architectures
4.2 Compilation
4.2.1 Performance evaluation of the compilation flow
4.2.2 Comparison of the register allocation approach with state of the art predication techniques
4.2.3 Compiling smart visual trigger application
4.3 Conclusion
5 The Heterogeneous Parallel Ultra-Low-Power Processing-Platform (PULP) Cluster
5.1 PULP heterogeneous architecture
5.1.1 PULP SoC overview
5.1.2 Heterogeneous Cluster
5.2 Software infrastructure
5.3 Implementation and Benchmarking
5.3.1 Implementation Results
5.3.2 Performance and Energy Consumption Results
5.4 Conclusion
Summary and Future work
References