Compilation flow for the Integrated Programmable Array Architecture

Get Complete Project Material File(s) Now! »

Data Level Parallelism (DLP)

The computational resources in this approach operate on regular data structures such as one and two-dimensional arrays, where the computational resources operate on each element of the data structure in parallel. The compilers target accelerating DLP loops, vector processing or SIMD mode of operation. CGRAs like Morphosys, Remarc, PADDI leverage SIMD architecture. The compilers for these architectures target to exploit DLP in the applications. However, DLP-only accelerators face performance issues while the accelerating region does not have any DLP, i.e. there are inter iteration data dependency.

Instruction Level Parallelism (ILP)

As for the compute intensive applications, nested loops perform computations on arrays of data, that can provide a lot of ILP. For this reason, most of the compilers tend to exploit ILP for the underlying CGRA architecture. State of the art compilers which tend to exploit the ILP, like RegiMap [45], DRESC [74], Edge Centric Modulo Scheduling (EMS) [81] mostly rely on software pipelining. This approach can manage to map the innermost loop body in a pipelined manner. On the other hand, for the outer loops, CPU must initiate each iteration in the CGRA, which causes significant overhead in the synchronization between the CGRA and CPU execution. Liu et al in [67] pinpointed this issue and proposed to map maximum of two levels of loops using polyhedral transformation on the loops. However, this approach is not generic as it cannot scale to an arbitrary number of loops. Some approaches [66] [62] use loop unrolling for the kernels. The basic assumption for these implementations is that the innermost loop’s trip count is not large. Hence, the solutions end up doing partial unroll of the innermost loops. The outer loops remain to be executed by the host processor. As most of the proposed compilers handle innermost loop of the kernels, they mostly bank upon the partial predication [48] [13] and full predication [4] techniques to map the conditionals inside the loop body.
Partial predication maps instructions of both if-part and else-part on different PEs. If both the if-part and the else-part update the same variable, the result is computed by selecting the output from the path that must have been executed based on the evaluation of the branch condition. This technique increases the utilization of the PEs, at the cost of higher energy consumption due to execution of both paths in a conditional. Unlike partial predication, in full predication all instructions are predicated. Instructions on each path of a control flow, which are sequentially configured onto PEs, will be executed if the predicate value of the instruction is similar with the flag in the PEs. Hence, the instructions in the false path do not get executed. The sequential arrangement of the paths degrades the latency and energy efficiency of this technique. Full predication is upgraded in state based full predication [47]. This scheme prevents the wasted instruction issues from false conditional path by introducing sleep and awake mechanisms but fails to improve performance. Dual issue scheme [46] targets energy efficiency by issuing two instructions to a PE simultaneously, one from the if-path, another from the else-path. In this mechanism, the latency remains similar to that of the partial predication with improved energy efficiency. However, this approach is too restrictive, as far as imbalanced and nested conditionals are concerned. To map nested, imbalanced conditionals and single loop onto CGRA, the triggered long instruction set architecture (TLIA) is presented in [68]. This approach merges all the conditions present in kernels into triggered instructions and creates instruction pool for each triggered instruction. As the depth of the nested conditionals increases the performance of this approach decreases. As far as the loop nests are concerned, the TLIA approach reaches bottleneck to accommodate the large set of triggered instructions into the limited set of PEs.

Thread Level Parallelism

To exploit TLP, compilers partition the program into multiple parallel threads, each of which is then mapped onto a set of PEs. Compilers for RAW, PACT, KressArray leverage on TLP. To support parallel execution modes, the controller must be extended for supporting the call stack and synchronizing the threads. As a result, power consumption is increased.
The TRIPS controller supports four operation modes of operation to support all the types of parallelism [96]. The first mode is configured to execute single thread in all the PEs, exploiting ILP. In the second mode, the four rows execute four independent threads exploiting TLP. In the third mode, fine-grained multi-threading is supported by time-multiplexing all PEs over multiple threads. In the fourth mode each PE of a row executes the same operation, thus implementing SIMD, exploiting DLP. Thus, the TRIPS compiler can exploit the most
suited form of parallelism. The compiler for REDEFINE exploits TLP and DLP to accelerate a set of HPC applications. Table 1.2 presents an overview of several architectural and compilation aspects of the state of the art CGRA designs.

Table of contents :

List of figures
List of tables
Introduction
1 Background and Related Work
1.1 Design Space
1.1.1 Computational Resources
1.1.2 Interconnection Network
1.1.3 Reconfigurability
1.1.4 Register Files
1.1.5 Memory Management
1.2 Compiler Support
1.2.1 Data Level Parallelism (DLP)
1.2.2 Instruction Level Parallelism (ILP)
1.2.3 Thread Level Parallelism
1.3 Mapping
1.4 Representative CGRAs
1.4.1 MorphoSys
1.4.2 ADRES
1.4.3 RAW
1.4.4 TCPA
1.4.5 PACT XPP
1.5 Conclusion
2 Design of The Reconfigurable Accelerator
2.1 Design Choices
2.2 Integrated Programmable Array Architecture
2.2.1 IPA components
2.2.2 Computation Model
2.3 Conclusion
3 Compilation flow for the Integrated Programmable Array Architecture
3.1 Background
3.1.1 Architecture model
3.1.2 Application model
3.1.3 Homomorphism
3.1.4 Supporting Control Flow
3.2 Compilation flow
3.2.1 DFG mapping
3.2.2 CDFG mapping
3.2.3 Assembler
3.3 Conclusion
4 IPA performance evaluation
4.1 Implementation of the IPA
4.1.1 Area Results
4.1.2 Memory Access Optimization
4.1.3 Comparison with low-power CGRA architectures
4.2 Compilation
4.2.1 Performance evaluation of the compilation flow
4.2.2 Comparison of the register allocation approach with state of the art predication techniques
4.2.3 Compiling smart visual trigger application
4.3 Conclusion
5 The Heterogeneous Parallel Ultra-Low-Power Processing-Platform (PULP) Cluster
5.1 PULP heterogeneous architecture
5.1.1 PULP SoC overview
5.1.2 Heterogeneous Cluster
5.2 Software infrastructure
5.3 Implementation and Benchmarking
5.3.1 Implementation Results
5.3.2 Performance and Energy Consumption Results
5.4 Conclusion
Summary and Future work