Work stealing for IVM-based B&B on multi-core CPUs

Get Complete Project Material File(s) Now! »

Challenges in parallel Branch-and-Bound

Irregularity. Most of the difficulties that arise when implementing a parallel B&B al-gorithm are direct consequences of the algorithm’s irregularity. In each of the four presented parallelization models irregularity manifests itself in a different way. Because of the unpredictable pruning of branches, some subproblems require much more compu-tation time than others, which leads to load imbalance. As subproblems are dynamically assigned to processing units at runtime, dynamic load balancing is a requirement for an efficient exploitation of the available computing resources. The parallel evaluation of nodes yields a more fine-grained and often more regular workload. However, the evaluation of nodes may require a variable amount of time, depending on the problem being solved and the depth of each node.
The search space management as well as the problem-dependent node evaluation subroutines are often characterized by highly irregular control flows. In a permutation problem the memory accesses during node evaluation depend on the partially construc-ted solution. Many accesses are therefore masked and have irregular and unpredictable strides. Diverging instruction flows and random memory access patterns are well-known to be major obstacles to SIMD processing. This may compromise the efficient usage of GPUs and/or vector processing units – two of the main factors that have increased the performance of HPC systems over the last decade. Work pool management. We call work pool the data structure that is used to store generated and not yet evaluated subproblems. A subproblem is usually implemented as a structure containing all information necessary for its evaluation. The role of the work pool is essentially to allow the insertion/retrieval of subproblems. Moreover, the work pool may maintain subproblems in a certain order, facilitating the implementation of a search strategy. For instance, DFS corresponds to processing nodes in last-in first-out (LIFO) order and is therefore naturally implemented by a stack. Best-first search is usually implemented using priority queues. In general, priority queues offer good flexibility, because it is enough to change the sorting criterion to implement another search strategy.
In parallel B&B there are different strategies for implementing the work pool. One approach is to use a single centralized work pool, concurrently accessed by all workers to pick subproblems for branching/evaluation. In shared memory systems all workers con-currently access the same work pool to get subproblems. After branching the subproblem each worker (thread) inserts the evaluated and non-pruned children into the pool. In single-pool approaches synchronization between workers to gain exclusive access to the pool is inevitable. These concurrent accesses may create a bottleneck and memory con-tention, limiting the scalability of the approach. In distributed memory configurations the single pool strategy is implemented using the master-worker paradigm. Again, the scalability of this approach is limited as the master process becomes a bottleneck. Multiple-pool approaches aim at solving this issue by using several pools. There are variants of multiple pool B&B algorithms. The most popular are collegial, grouped and mixed [GC94]. In the collegial variant each worker has its private pool. The grouped approach uses one shared pool for a group of workers and the mixed variant combines both approaches using a hierarchy of pools. Multiple pool strategies alleviate the bottle-neck problem that occurs in single-pool approaches but they raise the issue of balancing the workload between multiple pools. Also, the sharing of knowledge among workers, like the best known solution and termination detection, become non-trivial. Generally speaking, multiple pool approaches require more sophisticated communication models than single pool approaches.
Load balancing. While the single work-pool approach implicitly balances the work load among workers, multiple-pool approaches require explicit dynamic load balancing. Over the last decade work stealing [BL99] has been widely adopted as a standard way to distribute tasks among workers. In the context of B&B tasks are subproblems (nodes of the B&B tree). In a work stealing algorithm, each thread uses a double-ended queue (deque) for storing tasks. Locally, a thread uses the tail of its deque as a stack, popping tasks to execute and pushing newly generated tasks onto the stack. When a thread’s deque is empty it becomes a thief and steals tasks from the head of another thread (called victim). Expressed in terms of a multiple-pool B&B algorithm, when a work pool is empty, one thread associated with the pool steals nodes from another work pool. There are mainly two reasons for using deques in work stealing approaches. The first is that stealing tasks from the head of the deque may allow the victim thread to continuously work on the deque’s tail without being slowed down by steal operations (work-first principle [FLR98]). A first non-blocking work stealing deque that prevents contention during concurrent operations was proposed by Arora, Blumofe, and Plaxton [ABP01]. Dinan et al. [DLS+09] propose a work-stealing deque partitioned into a local and a public portion using a periodically updated pointer. This data structure, called a split-queue, allows lockless accesses to both portions of the deque and requires locking only for updating the split-pointer. The scalability of this work stealing using split-queues has been demonstrated on up to 8 192 distributed processing cores. In [ACR13] it is
reported that concurrent deque operations require expensive memory fences, which has led to recent interest in implementations of work stealing with non-concurrent (private) data structures [ACR13; vDvdP14]. The second reason concerns the granularity of the work stealing mechanism. In B&B, like in many task parallel applications, nodes at the bottom of the task stack (i.e. older tasks) represent a larger amount of work as recently spawned tasks on top of the stack. Granularity (i.e. the number of nodes that are stolen) is one of the characteristic features of a work stealing strategy, together with the policy of selecting the work stealing victim. A recent survey of work stealing methods for scheduling parallel computations can be found in [YH17]. Data Structures. The two previous paragraphs illustrate the central role of the data structure used for the storage of the huge number of subproblems. As mentioned, work pools are usually implemented as stacks, deques or priority queues. Operations on the B&B tree, like node selection, insertion of branched nodes and work transfers between multiple pools are implemented as push and pop operations on dynamic sized data structures. We generically refer to this type of dynamic data structure as linked-lists (LL).
Using such data structures has many advantages. For instance, it is relatively easy to adapt B&B to different problems by changing the definition of a node. Similarly, the search strategy can be modified easily, for instance by using a different sorting criterion for a priority queue. However, departing from the general case of B&B, the particular structure of a class of problems can be exploited, making it possible to use other types of data structures.
For example, a DFS-B&B applied to a 0 − 1 integer COP could use a single bit array of length (problem size) and an integer indicating the current depth of the search. The values of the bit array up to depth represent the current partial solution and on each level the algorithm successively tries the alternatives 0/1. Branching consists in incrementing the current depth . Backtracking is performed by decrementing and incrementing the bit array. The basic idea is that it is not necessary to explicitly store all frontier nodes, because they can be deduced from the current active node. It should be emphasized that this is possible because the maximal B&B tree (which would be explored if no pruning was used) is known in advance, and the exploration order (DFS) is deterministic.

B&B for multi-core CPUs

Because of the simple basic formulation of B&B it is interesting to have a framework that allows users to easily customize B&B to solve their problems. Many software frameworks have been proposed, including Bobpp [Men17; Bob], PEBBL [EHP15] and PICO [EPH01], parts of the ACRO project [ACR], ALPS/BiCePS [RLS04], BCP and SYMPHONY, which are parts of the COIN-OR project [COI]. This list includes only those frameworks which appear to be maintained at the time of writing. B&B frameworks establish an interface between the user and the parallel machine by defining abstract types for search tree nodes and solutions. As a user, one provides concrete implementations of these types as well as branching and bounding procedures, while the framework handles more generic parts of parallel B&B. The men-tioned frameworks differ by the variant(s) of B&B they provide, the type of parallel model they propose and the parallel programming environment. These frameworks are usually designed as multilayered class libraries, integrating additional features by building on top of existing layers. For example, BiCePS is build on top of ALPS to provide data-handling capabilities required for implementing relaxation-based B&B, and PEBBL began its existence as the “core” layer of the parallel mixed integer programming (MIP) solver PICO. The older versions of these frameworks are often based on master-worker approaches. In order to avoid that the master processor becomes a bottleneck, hierarchical organiza-tions revealed more efficient than pure master-worker implementations [EHP15; Men17; BMT12b]. In these approaches groups of workers form ”clusters”, cooperating locally and interacting with the master through a form of middle management. The idea is to improve locality and relieve the master process by introducing hubs, each handling several workers (master-hub-worker approach). In general, the role of hubs consists in providing work to a set of workers and coordinating the search process locally, while limiting interaction with the master and worker processes. For the PEBBL framework near-linear speedups on over 6 000 CPU cores are obtained for large B&B trees and node evaluation costs of about 10 seconds [EHP15].
Recently Herrera et al. [HSH+17] compared three implementations of a global optim-ization (GO) B&B algorithm using different levels of abstraction: the Bobpp framework, Intel Thread Building Blocks and a custom Pthread implementation. While they find the Bobpp implementation easiest to code, the authors show that the two other solutions offer better scalability for the used test-case. For the optimized test functions, the au-thors report node processing rates of about 1 million nodes per second (Mn/s) for the sequential version of their custom implementation on a 2 GHz Sandy Bridge CPU.
Evtushenko, Posypkin, and Sigal [EPS09] present a software platform called BNB-Solver, allowing the use of serial, shared memory and distributed memory B&B. The proposed approach uses a global work pool and local work pools for each thread. Each thread stores generated subproblems in its local pool during N B&B iterations. After N iterations a part of the local nodes are transferred to the global pool. When the local pool of a thread is empty, the thread attempts to take nodes from the global pool and blocks if the global pool is empty. The algorithm terminates when the global pool is empty and all threads are blocked. The authors compare the performance of BNB-Solver with the ALPS and PEBBL frameworks and obtain results similar to Herrera et al. [HSH+17], in the sense that, for a knapsack-problem (with a reported sequential node processing rate in the order of 1 Mn/s) BNB-Solver outperforms both frameworks.
Casado et al. [CMGH08] propose two schemes for parallelizing B&B algorithms for global optimization on shared memory multi-core systems, Global- and Local-PAMIGO (Parallel advanced multidimensional interval analysis global optimization). Both al-gorithms are parallelized using POSIX threads. In Global-PAMIGO, threads share a global work pool and therefore a synchronization mechanism is used for mutually exclusive accesses to the pool. For Local-PAMIGO, where thread has its own pool of subproblems, a dynamic load balancing mechanism is implemented. A thread stops when its local pool of subproblems is empty. When the number of running threads is less than the number of available cores, and a thread has more than one subproblem in its local pool it creates a new thread and transfers a portion of its pool to the new thread. Local-PAMIGO ends when there exists no more running threads, and Global-PAMIGO ends when the global pool is empty. The authors report profiling results for PAMIGO which show that memory management represents a large percentage of the computa-tional burden. As a very large number of subproblems are created in a relatively short amount of time, the kernel needs to satisfy memory allocation and deallocation requests from all threads, creating memory contention.

B&B for Graphics Processing Units

The study of Jenkins et al. [JAO+11] provides a good overview of the challenges faced when implementing parallel backtracking on GPUs. Most of their conclusions from the investigation of GPU-based backtracking paradigm remain valid for B&B algorithm using a depth-first search strategy. A fine-grained parallelization of the search space exploration and/or the node evaluation is necessary in order to make use of the GPU’s massive parallel processing capabilities. This strongly depends on the nature of the problem being solved and on the choice of the parallelization model. Other critical factors include latency hiding through coalescence, saturation, and shared memory utilization [JAO+11]. Generally speaking, the algorithmic properties of B&B, irregularity of the search space, irregular control flow and memory access patterns are at odds with the GPU programming model. Also, memory requirements for backtracking and B&B algorithms are often difficult to estimate and may exceed the amount of memory available on GPUs. Several approaches for GPU-accelerated B&B algorithms have been proposed. These approaches correspond to different parallelization models and their design is often motivated by the nature of the problem being solved. According to the characteristics of the bounding function one may distinguish among approaches for fine-, medium- and coarse-grained problems. The GPU-B&B and backtracking algorithms for fine-grained problems proposed in [CMNL11; CNNdC12; FRvLP10; LLW+15b; RS10; ZSW11] perform massively parallel searches on the GPU, based on the parallel tree exploration model. The evaluation of a node for the -Queens problem in [FRvLP10; LLW+15b; ZSW11] requires only a few registers of memory and only a couple of bit-operations. The lower bound for the Asymmetric Traveling Salesman Problem (ATSP) used in [CMNL11; CNNdC12] is incrementally obtained by adding the cost of the last visited edge to the current cost and therefore has a complexity of (1). It requires an access to the distance matrix which can be stored in constant or texture memory. The size of the problems being solved is < 20 for both the ATSP and the -Queens problems. These algorithms for fine-grained problems share a common approach: the search is split in two parts, an initial CPU-search and a final GPU-search. The upper tree of depth is processed in sequential or weakly parallel manner on CPU, generating a set of active nodes at depth . This active set is sent to the GPU, where the lower part of the tree is processed in parallel. Each node of the active set is used as root node for an independent search, which is mapped either to a thread or a warp. This approach requires very careful tuning of the cutoff depth, which strongly influences granularity and load balance.

Hybrid and distributed parallel B&B

There are very few works on the parallelization of B&B using multiple GPUs and CPUs in distributed heterogeneous systems. In [VDM13] a linked-list-based fully distributed hybrid B&B algorithm combining multiple GPUs and CPUs is proposed. As a test-case 20 jobs-on-20 machines FSP instances are used on a platform composed of 20 GPUs and 128 CPUs. For load balancing a random work stealing mechanism is used. The authors propose an adaptive granularity policy to adapt the quantity of stolen nodes at runtime to the normalized computing power of thief and victim. The algorithm is based on a 2-level parallel model, using GPUs for parallel evaluation of lower bounds. In order to reduce CPU-GPU communication overhead, an asynchronous implementation with overlapping host and device computations is proposed. Experimentally, near linear mixed scalability is shown up to 20 GPUs and 128 CPUs. In [CMMT13] the combined usage of multi-core and GPU-processing is investigated. An experimental comparison of concurrent and cooperative approaches shows that the cooperative approach improves the performance with respect to a GPU-only approach while the concurrent approach is not beneficial. Among other issues, the authors identify the reduction of CPU-GPU communication overhead as a major challenge and propose overlapping communication schemes and auto-tuning of the offloaded pool sizes to answer this challenge.

Table of contents :

Contents
Introduction
1 Parallel Branch-and-Bound algorithms
1.1 Introduction
1.2 Solving permutation combinatorial optimization problems
1.3 Branch-and-Bound algorithms
1.3.1 Terminology and general description
1.3.2 Models for parallel Branch-and-Bound
1.3.3 Challenges in parallel Branch-and-Bound
1.4 Computing Environments
1.5 Related work
1.5.1 B&B for multi-core CPUs
1.5.2 B&B for Graphics Processing Units
1.5.3 Hybrid and distributed parallel B&B
1.6 Test-cases: Permutation-based COPs
1.6.1 Flowshop Scheduling Problem (FSP)
1.6.2 Quadratic Assignment Problem (QAP)
1.6.3 𝑛-Queens Problem
1.6.4 B&B tree analysis of the test problems
2 IVM-based B&B for multi-/many-core systems
2.1 Introduction
2.2 IVM-based parallel Branch-and-Bound
2.2.1 IVM-based serial B&B
2.2.2 Position vector: factoradic numbers
2.2.3 Work units: intervals of factoradics
2.2.4 Work unit communication
2.3 Work stealing for IVM-based B&B on multi-core CPUs
2.3.1 Work stealing using factoradic intervals
2.3.2 Victim selection policies
2.3.3 Granularity policies
2.4 Accleration of bounding operator
2.4.1 GPU acceleration
2.4.2 Vectorization of the FSP bounding procedure
2.5 Experiments
2.5.1 Evaluation of data structures for B&B
2.5.2 GPU-acceleration of the bounding operator
2.5.3 Evaluation of Work Stealing Strategies
2.5.4 Performance evaluation on Intel Xeon Phi
2.5.5 MC-B&B: performance on different multi-core CPUs
2.6 Conclusions
3 GPU-centric Branch-and-Bound
3.1 Introduction
3.2 Discussion of design choices
3.3 GPU-B&B and GPU-backtracking
3.3.1 GPU-B&B: 2-level parallelization
3.3.2 Thread-data mapping and branch divergence reduction
3.3.3 GPU-BT: 1-level parallelization
3.4 Work stealing strategies for GPU-B&B
3.4.1 Victim Selection policies
3.4.2 Work stealing for multi-GPU-B&B
3.5 Experiments
3.5.1 Evaluation of Mapping approaches
3.5.2 Evaluation of Work Stealing strategies
3.5.3 Scalability analysis
3.5.4 Multi-GPU-B&B performance evaluation
3.5.5 Hybrid CPU-multi-GPU-B&B
3.6 Conclusions
4 Branch-and-Bound for hybrid HPC clusters
4.1 Introduction
4.2 B&B for hybrid clusters
4.2.1 B&B@Grid
4.2.2 Design of hybrid distributed B&B
4.2.3 Redundant exploration
4.2.4 Implementation of worker process
4.3 Experiments
4.3.1 Experimental protocol
4.3.2 Resolution of very large problem instances
4.3.3 Scalability: Ouessant
4.3.4 Hybrid CPU/GPU scalability
4.3.5 Solving other 50×20 FSP instances
4.4 Conclusion
5 Conclusions and Perspectives
List of Figures