Dynamic Load Balancing for Distributed Link-Heterogeneous Computing Environments

Get Complete Project Material File(s) Now! »

Decentralized computing architectures

This model attempts to leverage the bottleneck issue of the centralized architecture as it is fully distributed and does not count on any centralized master. There is no diﬀerence between master or worker in this model. Each PU can play the role of both master and worker so that the communications are evenly distributed among them. This model can be classified according to logical overlay or topology induced by PUs communications:
• Unstructured Overlay: there is no specific overlay topology imposed glob-ally upon all PUs and the topology is often random. Each PU broadcasts messages to its neighbors when sending a request. The broadcast is repeated until the sender receives the answer or a maximum number of flooding is reached.
• Structured Overlay: an overlay topology is used to structure the PUs (or peers) in the system. The overlay structure aims at routing traﬃc eﬀectively through the network. PUs in a structured overlay have to maintain a list of neighbors that satisfy a given criterion. Stoica et al. [Stoica 2001] firstly pro-posed to organize the PUs as a Ring and use Distributed Hash Table (DHT). Later on, the DHT is adapted in many works (e.g YaCy [Yacy ], CoralCDN [CoralCDN ], Kademlia [Maymounkov 2002].
In this thesis, we also studied the impact of a structured overlay on the perfor-mance of parallel B&B in distributed large scale systems. In particular, we organize all available PUs as a tree and use the tree overlay structure to handle all the commu-nications of parallel B&B such as load balancing, knowledge sharing and distributed termination detection in a minimal cost.

Parallel B&B for Distributed Computing Environ-ments

In the previous sections, we briefly presented some characteristics of the considered computing environments, sequential/parallel B&B as well as some general imple-mentation issues. In this section, we will firstly detail specific challenges of parallel B&B algorithms, then we will present some related works of parallel B&B on both shared and distributed memory systems.

Main challenges in parallel distributed B&B

At a first glance, it is straightforward to parallelize a B&B algorithm because each of the generated subproblems can be solved independently. However, the tree-based parallelization yields highly irregular computations and raises diﬃcult challenges:
• tasks are created dynamically in the course of the algorithm.
• tasks are assigned to PUs at runtime.
• the structure of the B&B tree to explore is not known beforehand.
• workload of PUs varies dynamically during execution.

B&B on shared memory systems

Barreto et al. [Barreto 2010] conducted a comparison of parallel B&B approaches on shared and distributed memory systems as well as their potential improvements compared to the sequential one. The authors used OpenMP and MPI to implement their approaches on shared and distributed memory respectively. In the OpenMP approach, a single work pool storing all tasks (i.e generated subproblems of B&B) is shared by all worker threads. In order to avoid the data race situation at the pool, a synchronization technique is chosen. It properly handles multiple accesses to the single pool either when idle threads try to retrieve tasks or when working threads push new tasks. In [Barreto 2010], the authors reported a good speedup of OpenMP and MPI approaches compared with the sequential one. However, the speedup of MPI was found to be slightly better than the OpenMP approach which may appear counter-intuitively at the first sight. It is in fact well understood that inter-communication is much costly in MPI since the messages have to pass through standard networks connecting computers. However, this result also enlightens the negative impact of using synchronization mechanisms in shared memory systems. It can be explained as following. Firstly, high parallelism can not be achieved since all working threads are serialized while accessing the single pool. Secondly, the overuse of synchronization also introduces an important overhead.
Casado et al. [Casado 2008] proposed two schemes for parallelizing B&B algo-rithms on shared memory multicore systems. In the first scheme, so-called Global-PAMICO, all threads share a global pool of generated subproblems therefore a synchronization mechanism is used to synchronize the accesses of all threads to the global pool. In the second scheme, so-called Local-PAMICO, each thread manages a local pool to avoid a significant overhead of the synchronization caused by the global pool of the first scheme. Besides, a thread terminates itself when there is no more subproblems in its local pool. A dynamic load balancing is proposed to deal with the irregularity of B&B as following. At each iteration, when a certain condition is satisfied, a thread will generate a new one and migrates work from its local pool to the pool of the new thread. The condition is described as following: the number of running threads are less than the total number of available cores and
there is more than one subproblem in the local pool of the thread. This strategy might also introduce a significant overhead when creating a lot of new threads due to the irregularity of parallel B&B. Similarly, in [Mezmaz 2013], the authors presented an approach to parallelize B&B on multicore systems. In this approach, all threads also share a single work pool. They reported a big gap between their implementation and the ideal linear speed up. In fact, the larger the number of threads/cores are used, the bigger the gap is. There are many reasons behind the scene however synchronization is the biggest issue.
In [Evtushenko 2009], the B&B solver (BNB-Solver), a software platform allow-ing the use of serial, shared memory and distributed memory B&B algorithm is presented. BNB-Solver is based on the parallel exploration of the tree and assumes that several CPU threads are used where each thread has its local pool of subprob-lems and shares a global pool of subproblems with other threads. In BNB-Solver, each thread executes a fixed number of N iterations of the sequential B&B algo-rithm. During the N iterations, each thread stores the generated new subproblems in its local pool. It only transfers a part of subproblems from the local pool to the global pool at the end of N iterations. Each thread tries to select from its local pool the next subproblem to be processed. If the local pool is empty, the thread selects a subproblem from the global pool. If the global pool is empty, the thread blocks itself until another thread puts at least one subproblem in the global pool. Once new subproblems are inserted into the global pool, blocked threads are released and subproblems are retrieved from the global pool. BNB-Solver ends when the global pool is empty and all threads are blocked.

READ International panel on climate change methodological approach

Table of contents :

1 Parallel Branch and Bound on Distributed System
1.1 Introduction
1.2 Serial Branch and Bound
1.3 Parallel Branch and Bound algorithm
1.3.1 Classification of Parallel B&B
1.4 Implementation considerations
1.4.1 Work pool management
1.4.2 Communication model
1.5 Computing Environments
1.5.1 Shared Memory Systems
1.5.2 Distributed Memory Systems
1.6 Parallel B&B for Distributed Computing Environments
1.6.1 Main challenges in parallel distributed B&B
1.6.2 B&B on shared memory systems
1.6.3 B&B on distributed memory systems
1.7 Conclusions and discussions
2 Dynamic Load Balancing for Homogeneous Computing Environments
2.1 Introduction
2.2 Dynamic Load Balancing
2.2.1 Overview
2.2.2 Work Stealing
2.3 Random Work Stealing and Parallel B&B
2.3.1 Preliminaries
2.3.2 Work sharing
2.3.3 Knowledge sharing and termination detection
2.4 Overlay-based Work Stealing and Parallel B&B
2.4.1 Preliminaries
2.4.2 Tree-based work stealing
2.4.3 Bridge-based work stealing
2.4.4 Cooperative tree-dependent work sharing
2.5 Application Benchmarks
2.5.1 Flow-Shop
2.5.2 Unbalanced Tree Search
2.6 Large Scale Experimental Analysis
2.6.1 Experimental setting
2.6.2 Comparison between parallel and sequential deployment
2.6.3 Comparison between TR and TD
2.6.4 Comparison between TD and BTD
2.6.5 TD and BTD vs. AHMW for B&B
2.6.6 BTD vs. MW vs. RWS for B&B
2.6.7 Scalability of BTD vs. MW for B&B
2.6.8 Scalability of BTD vs. RWS for B&B and UTS
2.7 Conclusion
3 Dynamic Load Balancing for Node-Heterogeneous Computing Environments
3.1 Introduction
3.2 A comprehensive overview of our approach
3.2.1 Mapping B&B Parallelism (Q1)
3.2.2 Workload Irregularity (Q2)
3.2.3 PUs Compute Power (Q3)
3.3 The 2MBB architecture: Multi-CPUs Multi-GPUs Parallel B&B
3.3.1 Host-device parallelism for single CPU-GPU
3.3.2 Adaptive Stealing for Multi-CPUs Multi-GPUs
3.4 The 3MBB architecture: Multi-cores Multi-CPUs Multi-GPUs Parallel B&B
3.4.1 Intra-node parallelism
3.4.2 Inter-node parallelism
3.5 Experiments
3.5.1 Experimental Setting
3.5.2 Performance of 2MBB architecture
3.5.3 Performance of 3MBB architecture
3.6 Conclusion
4 Dynamic Load Balancing for Distributed Link-Heterogeneous Computing Environments
4.1 Introduction
4.2 Work stealing in distributed link-heterogeneous computational environments
4.2.1 Distributed Link-heterogeneous Computational Environments
4.2.2 Work stealing in link-heterogeneous environments
4.3 Link-heterogeneous work stealing
4.4 Experimental Design and Methodology
4.4.1 Methodology
4.4.2 Network instances
4.4.3 Protocol deployment
4.5 Experimental Results and Analysis
4.5.1 Overall performances
4.5.2 Analysis of Protocol Dynamics
4.6 Conclusion
Conclusions and Perspectives
Bibliography