allocating CPM tasks to heterogeneous platforms

Get Complete Project Material File(s) Now! »

Multiprocessors & Parallel Systems

Introduction

Gordon E. Moore, the co-founder of Intel and Fair-child Semiconductor observed, in 1965, that the number of transistors in a dense integrated circuit is doubling every year in the number of components per integrated circuit, and projected this rate of growth would continue for at least another decade. In 1975, looking forward to the next decade, he revised the forecast to doubling every two years. This observation is called the moore’s law (please refer to Aspray, 2004, Moore, 2006, Moore, 2005). In the last decade, Intel stated that the pace of advancement has slowed. Starting from 2004, the market leaders have difficulties of satisfying Moores law greedy de-mand for computing power using classic single processor architectures because increasing the operating clock speed improvements slowed due to the serious heating problems and consider-able power consumption (see Figure 1.1) and also due to the increasing gap between processor and memory speeds. This, in effect, pushes cache sizes to be larger in order to mask the latency of memory. This helps only to the extent that memory bandwidth is not the bottleneck in per-formance, the increasing difficulty of finding enough parallelism in a single instruction stream to keep a high-performance single-core processor busy.
In order to continue performance improvements, processor manufacturers such as Intel and AMD have turned to multicore designs. the idea is to put several computing elements on the same die and operate these cores on lower frequencies.
A multicore processor is a single computing component with more than one computing element (referred as “cores”). Manufacturers integrate cores onto a single integrated circuit. In contrast to multicore systems, the term multiprocessors refers to multiple physically separate processing-units.

Classification of multicore systems

Multicore processors have been pervading and several architectures of multicore systems had been proposed. For example, cores may or may not share caches, and they may implement mes-sage passing (MPI itemplementations 2016) or shared-memory inter-core communication (such as OpenMP Architecture Review Board, 2008). At the same time, cores may run same instruc-tions or different ones. Cores may be interconnected by a single bus, ring, two-dimensional mesh, and crossbar, . According to a criteria (Instruction set, memory, micro-architectures), multicore (multiprocessor) systems can be classified to distinguish between them.

According to Flynn

Flynn, 1972 classification of architectures is based upon the number of instructions that could be run at the same time (single, multiple), and on the data streams on which instructions are applied (also single, multiple). Hence, 4 classes can be distinguished:
Single Instruction, Single Data (SISD): Also known as the Von Neumann machine, or sequential computer. In this class no parallelism is allowed, only one instruction is run at the time. A single control unit fetches a single instruction from memory. However, SISD machines can have concurrent processing characteristics. Pipelined processors and superscalar processors are common examples found in most modern SISD computers (Michael, 2003).
Single Instruction, Multiple Data (SIMD): In such architectures, the same instruction can be applied to different data at the same time. In January 8, 1997, Intel proposed the 1st processor with MMX technology, The Pentium MMX (P166MX) operating at frequency 166 Mhz (P166MX) is the first SIMD machine.
Multiple Instruction, Single Data (MISD) Multiple instructions operate on the same data stream. It is very uncommon architecture and only few cases can use this kind of machines.
Multiple Instruction, Multiple Data (MIMD) Multiple processing elements simultane-ously executing different instructions on different data. This class is the more general than all previous classes. All architectures, we use in this thesis, are from this class of multicore architectures. An example of MIMD system is Intel Xeon Phi, descended from Larrabee microarchitecture. These processors have multiple processing cores (up to 61 as of 2015) that can execute different instructions on different data (Pfister, 2008).

According to architecture and microarchitecture

According (as state in Davis and Burns, 2011) to the difference (architecture and micro-architecture)
between the computing elements, the multiprocessors (multicores) can be classified to:
Identical: The processors are identical; hence the execution time of a processing is the same on all processors. The odroid C2 contains an identical core platform compound of 4 ARM cortex A53 processors (HardKernel, 2016).
Uniform: The rate of execution of a processing depends only on the speed of the pro-cessor. Thus, a processor of speed 2, will execute a processing at twice of the rate of a processor of speed 1. The third generation of Intel i5 can be considered as a unifrom architecture (Intel, 2016).
Heterogeneous: The processors are different; hence the rate of execution of a processing depends on both the processor and the task. Indeed, not all tasks may be able to execute on all processors and a processing may have different execution rates on two different processors operating at the same speed. The single ISA platforms such ARM big.LITTLE allow to have the same instruction set architecture in all cores, but with different microar-chitecture and architecture, thus allowing a task to be executed on any core. SAMSUNG EXYNOS 5422 is a Single ISA platform compound of 8 cores: 4 “big” cores ARM A15, and 4 little cores ARM A7 (Samsung, 2016).
In Chapter 4, we will be interested in uniform architectures, and in single ISA architec-tures, especially SAMSUNG EXYNOS 5422 processor in Chapter 5. In last chapter, we will be restricted to only identical core platforms.

According to memory architecture

Shared memory A shared memory multicore (multiprocessor) offers a single memory space used by all processors. Any processors can access physically to data at any location in the memory (see Figure 1.2).
Distributed memory refers to a multicore (multiprocessor) in which each processor has its own private memory. Processings can only operate physically on local data, and if remote data is required, the processing must communicate with one or more remote pro-cessors (see Figure 1.3). hybrid memory refers to a multiprocessor in which each core has its own private memory, and all cores share a global memory. The largest and fastest computers in the world today employ both shared and distributed memory architectures (see Figure 1.4).
In the experiments that we present in chapter 5, we will be restricted to a platform with a shared memory.

Programming parallel architecture

Parallel computing is a type of computation in which many calculations are carried out si-multaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved at the same time. There are several different forms of parallel com-puting: bit-level, instruction-level, data, and task parallelism. Parallelism has been employed for many years, mainly in high-performance computing, but interest in it has grown lately due to cyber-physical systems. Parallel computing is closely related to concurrent computing, they are frequently used together, and often conflated, though the two are distinct: it is possible to have parallelism without concurrency (such as bit-level parallelism), and concurrency without parallelism (such as multitasking by time-sharing on a single-core CPU). In some cases par-allelism is transparent to the programmer, such as in bit-level or instruction-level parallelism, but explicitly parallel algorithms, particularly those that use concurrency, are more difficult to write than sequential ones, because concurrency introduces several new classes of potential software bugs, of which race conditions are the most common. Communication and synchro-nization between the different subtasks are typically some of the greatest obstacles to getting good parallel program performance.
Before talking in more details about programming parallel architectures, we give a quick overview on two fundamental concepts in parallel computing: Threads and Processes.

Thread & Process

First, we will give basic definitions of process, thread and how a real-time task is implemented a like thread. A process is an instance of a computer program that is being executed. Each process has a Process Control Block (PCB) which contains information about that process. Depending on the implementation of PCB in OS1, PCB may hold different pieces of information, commonly it contains:

PID: Process Identifier
PPID the Parent Process Identifier; UID: User Identifier;
The values of registers: The current state of the process even if it is ready, running, of blocked;
Instruction Pointer (IP) or Program counter (PC): it contains the address of the next instruction to execute;
Process addressing space;
PCB may contains other information like the processor time, register values, .

A thread is the smallest sequence of instructions that can be managed independently by a scheduler. The implementation of threads and processes differs between operating systems, but in most cases a thread is a component of a process. If a process has only one thread, the thread is executed in sequential (single thread). If the process contains multiple threads, they execute concurrently and share resources such as memory. On uniprocessor systems, the CPU is switched between multiple threads/processes. When a thread/process execution is inter-rupted from a higher priority task, all information about the interrupted task are saved, and the information of the new one is loaded, this operation is called Context Switch.

Table of contents :

Acknowledgements
Abstract
Introduction
I Context, Motivations & Related work
1 Multiprocessors & Parallel Systems
1.1 Introduction
1.1.1 Classification of multicore systems
1.2 Programming parallel architecture
1.2.1 Thread & Process
1.2.2 Sources of parallelism
1.2.3 Communication models
1.2.4 Decomposition & granularity of a parallel task
1.2.5 Limits and costs of parallel programming
1.3 Parallel models
1.3.1 Fork-Join model
1.3.2 Gang Model
1.4 Designing a parallel code
1.5 Power consumption in multiprocessor systems
1.5.1 DVFS: Dynamic voltage and frequency scaling
1.5.2 DPM: Dynamic Power Management
1.6 Conclusion
2 Introduction to real-time systems
2.1 Introduction
2.2 Task Model
2.3 Priority Driven Scheduling
2.3.1 Scheduling characteristics
2.4 Uniprocessor Scheduling
2.4.1 Rate Monotonic RM
2.4.2 Deadline Monotonic DM
2.4.3 Earliest Deadline First
2.5 Multiprocessor Scheduling
2.5.1 Partitioned Scheduling
2.5.2 Global Scheduling
2.5.3 Semi-Partitioned
2.6 Programming Real-time systems
2.6.1 Real-time Scheduling policies In LINUX Kernel
2.6.2 POSIX Threads
2.7 Conclusion
3 Parallel Real-time: Related work
3.1 CPS Systems Needs
3.1.1 Real-time needs
3.1.2 Parallel computing needs
3.1.3 CPS and energy consumption
3.2 This work
3.2.1 Global or Partitionned?
3.2.2 What kind of parallelism to do and where?
3.2.3 What energy saving techniques are we going to use?
3.3 Related work
3.3.1 Taxonomy on parallel real-time tasks
3.3.2 OpenMP
3.4 Parallel Real-time tasks & scheduling
3.4.1 Gang Model
3.4.2 Multithread Model
3.4.3 Federated scheduling
3.5 Related work to energy consumption
3.6 Conclusion
II Contributions
4 FTC on Uniform cores
4.1 Introduction
4.2 System overview
4.3 Architecture Model
4.4 Task model
4.5 Power & Energy Models
4.5.1 Power Model
4.5.2 Energy Model
4.6 Allocation and Scheduling
4.6.1 Exact Scheduler
4.6.2 FTC Heuristic
4.7 Experimentation
4.7.1 Task Generation
4.7.2 Simulations
4.7.3 Results & discussions
4.8 Conclusion
5 allocating CPM tasks to heterogeneous platforms
5.1 Introduction
5.2 System Model
5.2.1 Experimental platform
5.2.2 Architecture Model
5.2.3 Model of the execution time
5.2.4 Parallel moldable tasks
5.2.5 Power model
5.2.6 Energy model
5.3 Allocation & Scheduling
5.3.1 Optimal schedulers
5.3.2 Scheduling heuristics
5.3.3 Frequency selection
5.3.4 CP partitioning
5.4 Results and discussions
5.4.1 Task Generation
5.4.2 Simulations
5.4.3 Scenario 1
5.4.4 Scenario 2
5.5 Conclusion
6 Parallel Di-graph model
6.1 Introduction
6.2 Some related work
6.3 System Model
6.3.1 Architecture model
6.3.2 Task Model
6.4 Parallel applications
6.4.1 MPEG encoding/decoding
6.4.2 Array-OL
6.5 Schedulability analysis
6.5.1 Decomposition
6.5.2 Analysis
6.6 Heuristics
6.6.1 Task decomposition & thread allocation
6.7 Results and Discussions
6.7.1 Task Generation
6.7.2 Simulations
6.7.3 Scenario 1
6.7.4 Scenario 2
6.8 Conclusion
Conclusion & Perspactives
Personal publications
Bibliography