Mapping applications onto NoC-based many-cores
The introduction of NoC-based architectures was accompanied by the definition of novel mapping techniques targeting NoC-based MPSoCs. Indeed, if paralelism is recognized as the only way of providing scalable performance, this scalability comes at the price of increased complexity of both the software and the software mapping (allocation and scheduling) process.
Part of this complexity can be attributed to the steady increase in the quantity of soft-ware that is run by a single system. But there are also significant qualitative changes con-cerning both the software and the hardware. In software, more and more applications in-clude parallel versions of classical signal or image processing algorithms [Aubry et al., 2013, Gerdes et al., 2012, Villalpando et al., 2010], involving potentially complex synchroniza-tions between the sequential programs executed on the various cores. Such applications are best modeled using data-flow models (as opposed to so-called independent tasks that are common in classical real-time).
Designing parallel software is difficult in itself, relying on notoriously hard disciplines such as parallel programming [Kwok and Ahmad, 1999] and multi-processor scheduling [Ramamritham et al., 1993]. The picture is further complicated when considering real-time aspects. Providing functional and real-time correctness guarantees requires an accu-rate control of the functional and temporal interferences due to concurrent use of shared resources. Depending on the hardware and software architecture, this can be very diffi-cult [Wilhelm and Reineke, 2012]. In our case, there are two main reasons to this: The first one concerns the NoCs: as the tasks are more tightly coupled and the number of resources in the system increases, the on-chip networks become critical resources, which need to be explicitly considered and managed during real-time scheduling. Recent work [Shi and Burns, 2010, Kashif et al., 2013, Nikolic et al., 2013] has determined that NoCs have distinctive traits requiring significant changes to classical multi-processor scheduling theory [Goossens et al., 2003]. The second reason concerns automation: the complexity of many-cores and of the (parallel) applications mapped on them is such that the alloca-tion and scheduling must be largely automated.
Efficient and real-time implementation of applications onto NoC-based systems re-mains largely an open problem, with the issue of best mapping of computation parts (threads, tasks,…) onto processing resources amply recognized, while the issue of best use of the interconnect NoC to route and transfer data still less commonly tackled.
In the most general case, dynamic allocation of applications and channel virtualiza-tion can be guided by user-provided information under various forms as in OpenMP for Open Multi-Processing [OpenMP, 2008], CUDA for Compute Unified Device Architec-ture [Nvidia CUDA, 2006], and OpenCL for Open Computing Language [Khronos, 2011]. But there is no clear guarantee of optimality. Conversely there are consistent efforts, in the domains of embedded and HPC computing, aiming at automatic parallelization, compile-time mapping and scheduling optimization. They rely on the fact that applications are often known in advance, and deployed without disturbance from foreign applications, and without uncontrolled dynamic creation of tasks.
The results of this thesis fit in this “static application mapping” case. We focus on mapping techniques where all allocation and scheduling decisions are taken off-line. In theory, off-line algorithms allow the computation of scheduling tables specifying an opti-mal allocation and real-time scheduling of the various computations and communications onto the resources of the MPPA. In practice, this ability is severely limited by 3 factors:
1. The application may exhibit a high degree of dynamicity due to either environment constraints or to execution time variability resulting from data-dependent condi-tional control.4
2. The hardware may not allow the implementation of optimal scheduling tables. For instance, most MPPA architectures provide only limited control over the scheduling of communications inside the NoC.
3. The mapping problems we consider here are NP-hard. In practice, this means that optimality cannot be attained, and that efficient heuristics are needed.
The DSPINpro programmable Network-on-Chip
Our purpose is to investigate how the underlying architecture should offer the proper in-frastructures to implement optimal computation and communication mappings and sched-ules. We concretely support our proposed approach by extending the DSPIN 2D mesh network-on-chip (NoC) [Panades, 2008] developed at UPMC-LIP6. In the DSPIN NoC, we replace the fair arbitration modules of the NoC routers with static, micro-programmable arbiters that can enforce a given packet routing sequence, as specified by small programs. We advocate the desired level of expressiveness/complexity for such simple configuration programs. The result is named Programmable DSPIN, or DSPINpro.
To improve the efficiency and predictability of our MPPA architecture, and thus facil-itate real-time mapping, we also constrained and standardized the structure of the com-puting tiles connected to the DSPINpro NoC, as well as the software architecture of our implementations. On the hardware side, we constrain the type and number of tile com-ponents (CPUs, RAM banks, DMA units …).5 On the software architecture side, we constrain the memory organization, we impose that all computations are performed on lo-cal tile data, with specific “send” operations being in charge of all inter-tile data transfers (along with a software lock mecanism), and we require the explicit placement of input and output data of the tile on memory banks.
The Automatic real-time mapping and code generation
We propose a technique and tool for the automated mapping of real-time applications onto MPPA architectures based on 2D mesh NoCs. Our tool is named LoPhT, for Logical to Physical Time Compiler.
The global flow of our mapping technique, pictured in Fig. 1.3, is similar to that of the AAA methodology and the SynDEx tool [Grandpierre and Sorel, 2003]. It takes as input functional specifications provided under the form of data-flow synchronous specifications à la Scade/Lustre [Caspi et al., 2003]. Our input formalism allows the specification of conditional execution and execution modes, which are common features in complex em-bedded control specifications. To map such specifications onto MPPA architectures and provide hard real-time guar-antees, we rely on off-line allocation and scheduling algorithms. These algorithms also take as input a description of the MPPA platform, and non-functional constraints including allocation constraints and conservative upper estimates for the:
5We have also developed a tool allowing the automatic synthesis of the corresponding SystemC models and memory maps from simple architecture specifications defining the NoC size and the type and number of tile components.
Worst-case execution times (WCETs) of the data-flow blocks (seen as atomic com-putations).
The worst-case size of data transmitted through the data-flow arcs, needed to de-termine the worst-case communication time (WCCT) of basic communication/syn-chronization operations.
Starting from these inputs, our algorithms build reservation tables (also called schedul-ing tables or simply time tables) specifying for each resource of the platform its use by various computations or communications. Reservation tables are then converted into se-quential code ensuring the correct ordering of operations on each resource and the respect of the real-time guarantees.
Our main contribution was to provide new mapping algorithms that explicitly take into account MPPA-specific features. The first problem here is that NoCs are very different from the communication buses used in classical real-time scheduling work. NoCs are composed of multiple communication resources that must be considered separately during mapping. However, reservation of a communication path along the NoC requires the synchronized reservation of resources along the path, due to the limited amount of buffer memory inside the NoC. The second problem is that the large number of computation and communication resources requires the use of scalable, yet precise mapping algorithms.
To provide tight real-time guarantees, our mapping heuristics seek to achieve a good parallelization of the application while ensuring that concurrent computations and com-munications do not interfere (functionally or temporally) with each other outside of functionally-needed synchronization points. The absence of interferences reduces the pessimism of the worst-case timing analysis, and limits resource over-allocation. Achieving such functional and temporal isolation can be done with low overhead in MPPA architectures that provide the programmer with good control over the memory hierarchy and the interconnect.
An environment for virtual prototyping of MPPA applications
Together, our extension of the DSPIN NoC and the development of a novel mapping tech-nique and tool define a new environment for the virtual prototyping of real-time MPPA applications. The expression “virtual prototyping” is used here to mark the fact that hard-ware execution is simulated in software, as opposed to “emulated” on an FPGA target after full hardware synthesis. The SoCLib library [LIP6, 2011] allows both. We preferred the simulation-based approach because it facilitated both hardware design and the pre-cise timing measures needed to evaluate our real-time mapping technique on the resulting platform. Note that no difference exists between simulation and emulation from the point of view of software or the real-time properties, given that simulation is of cycle-accurate, bit-accurate (CABA) type.
In our environment, described in Fig. 1.4, objects with thick borders are artefacts or transformations that were created or significantly modified as part of this thesis. In our environment, we start by choosing the high-level configuration parameters of the MPPA platform, which include the size (number of tiles) of the MPPA, the structure of the stan-dard tile, and the positioning of I/O devices. Once configuration is fixed, we can instan-tiate the MPPA SystemC model. This model uses standard components from the SoCLib library, but also the DSPINpro network-on-chip and other hardware components devel-oped on top of SoCLib as part of our effort to build an MPPA platform reconciling timing predictability and efficiency. The model is then compiled to obtain the hardware simula-tor.
The configuration parameters are also used to build the architecture model taken as in-put by our mapping tool LoPhT. LoPhT also takes as input a functional specification pro-vided under the form of a data-flow synchronous specification and a non-functional spec-ification defining the worst-case durations of all data-flow blocks and communications, and possible allocation constraints. Worst-case durations are obtained through WCET analysis of the C code associated with the data-flow blocks and other library functions (but this aspect will not be covered in this thesis, the interested reader is invited to read [Puaut and Potop-Butucaru, 2013]). The LoPhT tool takes this input specification and transforms it into statically scheduled code for the CPU cores and the NoC routers. This code is compiled, separately for each sequential resources, using either gcc (the C code of the CPUs) or with nocc (designed and implemented by us) for the NoC programs. Resulting code is executed on the hardware simulator, which allows us to verify that the real-time guarantees computed by LoPhT are respected.
Of course, this environment is the result of highly collaborative work started between my team (INRIA AOSTE) and the SoC team of the UPMC/Lip6 laboratory (led at the time by A.Munier). From the SoC team, the main participants were François Pêcheux, Franck Wajsburt and Zhen Zhang, which have carried out most of the hardware design. From the AOSTE team, in addition to myself, Thomas Carle helped in the implementation of the scheduler, and Robert de Simone provided major insights on the high-level architectural modeling. My personal contributions are the following: I participated in the definition of the DSPINpro NoC and the MPPA platform by specifying what services the hardware should provide in order to allow efficient and predictable real-time implementation. Actual definition of the hardware was carried out by the Lip6 team and D. Potop.
Switching method and buffering policy
All NoC switching methods belong to one of two basic switching paradigms: circuit switching or packet switching. In circuit switching [Hilton and Nelson, 2006], commu-nications are performed through dedicated communication channels (called circuits) that connect the source and destination PEs. A circuit consists in a sequence of point-to-point physical links going all the way from the source PE to the destination PE. Two channels cannot share a physical link. This is achieved by statically fixing the output direction of each demultiplexer and the data source of each multiplexer along the channel path. Timing interferences between circuits are imposible. Thus, throughput is guaranteed and latency is predictible, but NoC use is usually low.
In packet switching, data is divided into small packets that are transmitted indepen-dently. Each packet is formed of a sequence of flits, where a flit is the amount of data that a link can transmit at the same time (in one clock cycle, for a synchronous NoC). Each packet must contain, in its header flits all the information needed to perform its commu-nication, such as destination, priority, etc. The packetization of data allows a link to be used by multiple data transitions at the same time, by interleaving the transmission of the packets coming from different sources. This is why NoC use figures are usually improved when comparing to circuit switching approaches.
Kalray MPPA NoC
The MPPA256 architecture of Kalray [Harrand and Durand, 2011, MPPA, 2012] is a tiled many-core with a 2D torus NoC. The topology of the interconnect is detailed in Fig. 2.10. This figure depicts the routers of the 16 computing tiles (the 4×4 central square) and the 16 routers connected to the various I/O devices. To facilitate description, our figure presents only the (unidirectional) links along one vertical line of tiles. The full NoC is obtained by repeating this pattern along each horizontal and vertical line of tiles.
NoC arbitration is done under a fair policy. Communications are performed along virtual channels, each (unidirectional) channel connecting one source tile with one desti-nation tile. Setting up a channel amounts to assigning it a route and a latency/throughput budget. It is required that for each physical link of the NoC, the sum of the transmission requirements of all the virtual channels using this link is less than the transmission ca-pacity of the link. Under this condition, all channels will provide the latency/throughput guarantees assigned to them. Ensuring that a virtual channel does not attempt to take more resources than its budget allows is realized through bandwidth management mechanisms. More precisely, each tile and I/O device in the chip contains a configurable bandwith limiter device. For each virtual channel, the respect of its latency/throughput budget is enforced by the bandwidth limiter of the source tile or I/O device.
ST Microelectronics STHORM
STHORM [Melpignano et al., 2012], initially called “Platform 2012” or P2012, was not designed as a many-core chip, but rather as a platform for the design of embedded many-cores for multi-media applications. It features a tiled organization, but the contents and size of each tile can be varied depending on the design needs. The platform was designed by ST Microelectronics and the CEA. To improve power efficiency and facilitate power management, tiles are connected via a fully asynchronous NoC called ANoC, designed by the CEA-Léti [Thonnart et al., 2010]. The NoC uses a source routing algorithm and provides QoS control under the form of 2 packet priority levels. A typical computing tile, as used in the test chip produced by ST Microelectronics, consists in 17 processing cores (of which 1 is dedicated to resource allocation/schedul-ing), several memory banks, power management, DMA, and synchronization units, and a special unit with support for data-flow programming.
Table of contents :
1.1 Thesis motivation
1.1.1 The advent of many-cores
1.1.2 The advent of Networks-on-Chips
1.1.3 Many-cores for hard real-time applications
1.1.4 Mapping applications onto NoC-based many-cores
1.2 Thesis contributions
1.2.1 The DSPINpro programmable Network-on-Chip
1.2.2 The Automatic real-time mapping and code generation
1.2.3 An environment for virtual prototyping of MPPA applications
2 State of the art
2.1 Network-on-Chip design
2.1.1 NoC building blocks
2.1.2 NoC topology
2.1.3 NoC switching
22.214.171.124 Switching method and buffering policy
2.1.4 Existing Network-on-Chip architectures
126.96.36.199 Kalray MPPA NoC
188.8.131.52 The scalar interconnect of MIT RAW
184.108.40.206 Other NoC architectures
220.127.116.11 Comparison with our work
2.2 Massively parallel processor arrays
2.2.1 Tilera TILEPro64
2.2.2 Kalray MPPA-256
2.2.3 Adapteva Epiphany
2.2.4 Intel SCC
2.2.5 ST Microelectronics STHORM
2.2.7 Academic MPSoC architectures with TDM-based NoC arbitration
2.3 Static application mapping
2.3.1 Off-line real-time multi-processor scheduling
18.104.22.168 The AAA/SynDEx methodology
2.3.2 The StreamIt compiler for the MIT RAW architecture
2.3.3 Compilation of the SC language for the Kalray MPPA256 platform
2.3.4 Other mapping approaches
3 Tiled MPPA architectures in SoCLib
3.1 MPPA structure
3.2 Memory organization
3.2.1 Distributed shared memory
3.2.2 Address structure
3.2.3 Global memory organization
3.2.4 Tile memory organization
3.2.5 Hardware/software interface
3.3 Improving the timing predictability of the SoCLib tile
3.4 SystemC simulation and compilation support
4 Programmable NoC arbitration
4.1 The case for programmed arbitration
4.1.1 The principle
4.1.2 Target application classes
4.1.3 The cost of programmability
4.2 Programmable DSPIN
4.2.1 NoC router extensions
4.2.2 Area overhead
4.3 A simple example in depth
4.4 Case study: the FFT
4.4.1 FFT algorithm description
22.214.171.124 Mapping onto the MPPA architecture
126.96.36.199 Traffic injection configuration
4.4.2 Evaluation of the slow-down due to traffic injection
4.4.3 Removing the slow-down through NoC programming
5 Off-line mapping of real-time applications using LoPhT
5.1 Background: AAA using the Clocked Graphs formalism
5.1.1 The Clocked Graph formalism
188.8.131.52 Functional specification
Support of a clock
184.108.40.206 Non-functional specification
5.1.2 Off-line scheduling of CG specifications
220.127.116.11 Scheduled clocked graphs
18.104.22.168 Real-time scheduling problem
22.214.171.124 Consistency of a scheduled clocked graph
126.96.36.199 Makespan-optimizing scheduling algorithm
5.2 Static (off-line) mapping onto MPPA architectures
5.2.1 AAA for NoC-based MPPA: The problem
5.2.2 Extension of the CG format
188.8.131.52 Modeling of MPPA resources
184.108.40.206 Memory footprint specification
220.127.116.11 Non-functional properties
Worst-case computation durations
Worst-case communication durations
5.2.3 Makespan-optimizing scheduling
18.104.22.168 Mapping NoC communications
22.214.171.124 Multiple reservations
5.3 Automatic code generation
5.3.1 Tile code generation
5.4 Experimental results
List of Publications