Landscape-Aware Selection of Metaheuristics

Get Complete Project Material File(s) Now! »

Automated Algorithm Selection

Metaheuristics are classically introduced as frameworks within which a user can gather some mod-ules to instantiate an algorithm. For instance, the design of an evolutionary algorithm requires to choose the population size, the variation and selection operators in use, the encoding structure, tness function penalization weights, etc. This highly exible design of metaheuristics allows for e cient abstractions but comes at the burden of having to solve an additional (meta-)optimization problem.
The impressive advances of machine learning (ML) techniques are currently shaking up literally every single scienti c discipline, often in the function to support decisions previously requiring sub-stantial expert knowledge by recommendations that are derived from automated data-processing techniques. Computer science is no exception to this, and an important application of ML is the selection and con guration of metaheuristics [HKV19, KHNT19, SM09], where automated techniques have proven to yield tremendous e ciency gains in several classic optimization tasks, including SAT solving [XHHL11], AI planning [VHCM15], and Traveling Salesperson Problem solving [KKB+18].
In the context of numerical optimization, supervised learning approaches are particularly com-mon for the automated selection of algorithms [KT19a, BDSS17, MSKH15]. These methods often build on features developed in the context of tness landscape analysis [ME13, PA12], which aims at quantifying the characteristics of an optimization problem through a set of features. More pre-cisely, a feature maps a function (the optimization problem) to a real number. Such a feature could measure, for example, the skewness of f, its multi-modality, or its similarity to a quadratic func-tion. In black-box optimization, the feature values need to be approximated from a set of (x; f(x)) pairs. The approximation of feature values through such samples is studied under the notion of exploratory landscape analysis (ELA [MBT+11]). ELA has been successfully applied, for example, in per-instance hyperparameter optimization [BDSS17] and in algorithm selection [KT19a].

Our Key Findings

In this thesis, we de ne six properties that feature should satisfy to be e cient:
• stability, i.e., ability of a feature to keep the same values with independent samples of same size;
• robustness to noise, i.e., ability of a feature to keep the same values with when uniform noise is applied to original samples;
• robustness to the sampling strategy, i.e., ability of a feature to keep the same values with di erent sampling strategies;
• expressiveness, i.e., ability of a feature to distinguish between di erent optimization prob-lems;
• robustness: ability of a feature to keep the same values with di erent number of samples;
• transformation invariance, i.e., ability of a feature to keep the same values when transfor-mations are applied to the tness function or to the samples.
We found that none of the feature analyzed fully satis es the six properties. One of the six properties is the invariance to the sampling strategy. We found that, surprisingly to what was recommended in the literature, the sampling strategy actually matters. We found important discrepancies in the feature values computed from di erent sampling strategies. In particular, distributions of nearest better clustering feature values computed with di erent sampling strategies are non-overlapping. Overall, feature values based on Sobol sampling have a better expressiveness.
We analyze the performance of 13 algorithms on two radar use-cases. Among the 13 algorithms, we consider ve variants of the CMA-ES [vRWvLB16], Particle Swarm Optimization (PSO) [KE95] or the Nelder-Mead [NM65]. The use-cases are solved for two budgets of function evaluations. A small budget of 500 function evaluations and a large budget of 2;500 function evaluations. In our analysis, we found a complementarity of algorithms depending on the budget, i.e., di erent algorithms are performing best depending on the number of function evaluations. Considering CMA-ES variants, we found that elitist variants are more suitable for low budget settings while non elitist variants are performing better when the number of function evaluations available is larger.
On the two use-cases, we found that the di erence between the single best solver (SBS1) and the virtual best solver (VBS2) is actually quite small, and thus, it is di cult to obtain better performances than those of the SBS. Yet, we found that the landscape-aware algorithm selection has similar performances to the single best solver (SBS).

Thesis Outline

This thesis is composed of four parts. Part I introduces the context of the thesis.
Part II surveys the background on black-box optimization and landscape-aware selection of metaheuristics.
Part III presents and analyzes the properties of landscape features. These landscape features and their properties constitutes the central part of a landscape-aware selection of metaheuristics.
In Part IV, we describe the optimization of radar networks. We de ne here the main use-cases of the thesis, we describe the results of di erent metaheuristics and compare them to the performance of a landscape-aware selection of algorithms.
We conclude this thesis in Part V with an overview of results from this thesis and an outlook for promising research directions.
Within the four parts, this thesis is composed of ten chapters. Chapter 2 introduces black-box optimization and presents a list state-of-the-art algorithms for numerical black-box optimization problems. The algorithms in this chapter are those that will later be used in this thesis.
In Chapter 3, the algorithm selection problem is introduced. We also introduce a method to tackle this problem, the landscape-aware algorithm selection. The landscape-aware algorithm selection allows to link data obtained by algorithms presented in Chapter 2 and feature data presented in Chapter 4 with the objective to nd the best suited algorithm.
Chapter 4 gives an introduction of continuous optimization problems’ properties and ways to describe them. This chapter introduces both high-level properties that de nes optimization problems and also low-level, cheap methods to gain insight on an optimization problem at hand. We focus particularly in this chapter on the exploratory landscape analysis technique. This technique will be used in the thesis in order to characterize problems characteristics. In this chapter, we also introduce new features based on existing ones in the Principal Component Analysis feature set (see Section 4.2.6).
Chapter 5 focuses on the computation of cheap problem features in order to e ciently charac-terize the optimization problems. This chapter de nes six properties of features that are important to take into account when features are computed in order to maximize their e ciency. We also analyze in this chapter to what degree landscape features satisfy these properties.
Chapter 6 describes the basic principles of radar operations. This chapter contains some key radar equations and processing that will in uence the objective function.
Chapter 7 introduces the components of the radar network con guration problem such as radar parameters and the target model. Chapter 7 also introduces the use-cases and the objective function that will be used in this thesis. We also introduce here the key functionalities of our gis framework: aggregation of radars into networks, constraint handling, handling of geographical data, etc.
Chapter 8 presents the results of a portfolio of optimization algorithms on the use-cases. This chapter also compares the performances of optimization algorithms to manually designed solu-tions. We show in this chapter that human-designed solutions are worse than the best performing algorithms and are close to the performances of random search.
Chapter 9 empirically investigates the application of landscape-aware algorithm selection on the radar network con guration problem and compares it to classical state-of-the-art solvers. This approach has similar performances to the single best solver on our use-cases.
We conclude this thesis with Chapter V with a summary of our results. We also share some promising research directions such as dynamic algorithm selection or arti cially created benchmark instances for real-world applications.

Black-Box Optimization

The general context of this work is that of black-box optimization where the goal is to minimize an objective function on a continuous search space f : Rd ! R:
No information is known about the characteristics of the objective function f (e.g., separability, di erentiability, smoothness, …) in black-box optimization. The only information available is the value of f(x) given a solution x = (x0; x1; : : : ; xd) 2 Rd. Through the optimization process, the goal is to nd one or multiple as good as possible solutions using as few evaluations as possible.
Many complex problems or real-world problems do not have any known analytic expression of their objective function. The values of the objective function are only reachable through experi-mentation or simulation. Black-box optimization is used to nd su ciently good solutions within a reasonable number of evaluations.
In order to nd these solutions, metaheuristics were developed throughout the years. Within metaheuristics, evolutionary computation techniques are widely used and proven e ective. In the next sections, the paradigm of evolutionary computation is described along with all the algorithms that will be used in this thesis.

Evolutionary Computation

History

The idea of Evolutionary Computation (EC) came up rst in the 1940s as Turing proposed a \genetical or evolutionary search ». In the 1960s, three di erent interpretations of this idea were si-multaneously developed: Fogel in the US in the 90s introduced Evolutionary Programming [Fog98] while Holland in the 70s presented Genetic Algorithm (GA) [Hol73]. Meanwhile, in Germany, Rechenberg and Schwefel introduced in the 60s Evolution Strategies (ES) [RTE65, Sch65]. In the 1990s, these three interpretations were merged into one technology called Evolutionary Computa-tion while a fourth inspiration had emerged: Genetic Programming (GP) [Koz94].

Metaphor from Biology

As the name Evolutionary Computation suggest, bio-inspired algorithm are developed with the main inspiration coming from the natural evolution as described by Darwin. Solution candidates of a population are generated in an environment and endeavor to survive. Their ability to survive, i.e., their tness is determined by the environment. The process of solving a problem results in a stochastic trial-and-error where a population is evolved in order to t the environment best. This phenomenon is coined survival of the ttest. Fitter solution candidates survive and reproduce when the others die. As in the Darwinian theory, some modi cations can occur through reproduction.
As the evolution process goes by, the constitution of the population is modi ed towards the ttest solution candidates. This process is captured by the analogy of a landscape. On this land-scape, d dimensions represents the biological traits of the solution candidates when an additional dimension represents the tness of each candidate solution. Hence, each peak in the landscape represents an area of t trait combinations.
These analogies correspond to classical optimization concepts. Each solution candidates are a candidate solution and the population represents a set of these candidates solutions. The envi-ronment of these solution candidates is represented by the search space Rd. The quality of the solution candidates x 2 is assessed by the tness function f : ! R which is named the objective function in classical problem solving.

Canvas of an Evolutionary Algorithm (EA)

There are multiple branches in EC but it is possible to de ne a general scheme that correspond to every evolutionary algorithm. A general diagram of an EA can be found in Figure 2.1. By de nition and construction, EAs are are stochastic and population-based, i.e., they process a collection of candidate solutions concurrently. The following sections present the main building blocks of an EA.

Initialization

In most EAs, the initialization is kept simple as a population is often generated at uniformly random. Quasi-random initialization procedure are also possible such as latin hypercube sam-pling [MBC79] or low-discrepancy sequences such as Sobol [Sob67] or Halton [Hal64] sequences. If one possesses knowledge on the underlying problem to solve, a heuristic could be applied in order to start with a population with greater tness.

Evaluation

The evaluation part is where the tness function is considered. The candidate solutions are evalu-ated. This part forms the basis for all the following blocks as all they will all need the results from the evaluation part.

Replacement

The replacement operator permits to update the actual population. At this stage, quite often, the number of solution candidates may exceed the authorized population size. Hence, the solution candidates that will form the next generation have to be chosen in both the o springs and/or parents solution candidates. The ones not selected will die while the others become the parents of the next generation. Di erent replacement strategies have been introduced such as generational replacement (e.g., no parents survive), steady-state replacement (e.g., one o spring replace the worst parent in the population), and all possibilities between these two strategies.

Stopping Criterion

Choosing a stopping criterion may depend on the knowledge available on the problem at hand. If there is a known global optimum x , a stopping condition may be obtaining a solution x with tness1 value f(x) f(x ) + » for some user-de ned » > 0. This stopping criterion is coined xed-target. If no global optimum is known, which is often the case for real-word problems, there exist two main categories of stopping criteria: one based on time, the other based on tness. The classical criteria used are:
• cap the CPU computation time;
• cap the number of tness evaluation;
• tness improvement is under a threshold for too many generations;
• diversity of the population drops under a threshold;
• the population distribution is ill-formed.
Capping the CPU computation time or the maximum number of tness evaluation is called xed-budget.

Selection

Selection is a mechanism that permits to de ne the solution candidates of a population that will become the genitors of the next generation. That will determine the distribution from which the next solution candidates are sampled. This selection is often based on quality, i.e., tter solution 1 to be minimized candidates stand better chances to get selected than low quality solution candidates. Allowing to select low quality solution candidates prevent the algorithm getting stuck in local optimum. Di erent selection mechanisms can be used such as elitism (i.e., only ttest candidate solutions are selected), deterministic or stochastic tournaments, random selection, rank-based selection (i.e., solution candidates are selected using their rank, not their directly their tness values) or roulette wheel (i.e., the selection probability of a solution candidate is proportional to its relative tness).

Variation

Variation operators aim at creating new solution candidates, i.e., o springs, from parents. These operators are of two types: mutation that are unary operators and recombination that are n-ary operators where n represent the number of solution candidates used for recombination.
Mutation A mutation is stochastic and unbiased transformation of a candidate solution. This operator has also a theoretical role as mutation can guarantee the ergodicity of the algorithm, i.e., that the algorithm can visit the whole search space. However, it is important to note that this property is realized only when the mutation operator allows to jump anywhere in the search space with a non-zero probability which is not always the case for all mutation implementations. Muta-tion for continuous variables is often achieved by modifying randomly some candidate solutions in the population. Any probability distribution can be chosen to perform the modi cations such as the uniform distribution or the normal distribution.
Recombination A binary operator is called recombination or crossover and merges two or more parents to create o springs. As mutation, this operator can be deterministic or stochastic and the parts inherited from one or another parent are random. Di erent recombination operators are available such as one or two points recombination. One cutting point k is de ned uniformly at ran-dom and applied to two parents x; y, i.e., parents are divided in two x = (x1; : : : ; xk; xk+1; : : : ; xd) and y = (y1; : : : ; yk; yk+1; : : : ; yd), d being the dimension of the problem. The new o spring z is created by taking elements from one or the other parent, i.e., z = (x1; : : : ; xk; yk+1; : : : ; yd) An-other possibility is the uniform recombination, i.e., the o spring is created by taken a coordinates of one or the other parent at uniformly at random.

READ Distances in Mathematical Programming

Evolution Strategies

Evolution Strategies (ES) are probably one of the simplest form of EA. They are mainly used in continuous optimization and has one variation operator: a mutation operated by a multivariate Gaussian. As an example, Algorithm 1 shows the pseudo-code of one of the simplest evolutionary algorithm: the (1 + 1) ES where there is one parent and one o spring.
Some parameters are involved in the multivariate Gaussian as the standard deviation or step-size and the covariance matrix C 2 Md d(R), d being the dimension of the problem. Algorithm 1 has a constant population of one candidate solution but more generic algorithms can have more parents, noted and/or more o springs, noted . In ES, there can be two types of selection that can be described by a particular notation:
• ( = + ) ES where the new parents are selected among the best o springs only. In this case, the tness of the population is allowed to decrease;
• ( = ; ) ES where the next generation of parents is created with the best solution candidates contained in the full population, i.e., parents and o springs. This con guration is called an elitist selection as the tness values in the population can only increase. The (1 + 1) ES is the simplest example of an elitist ES where the best solution is kept for the next iteration.
The parameter represents the number of parents used for recombination, i.e., parents are used for recombination.

The Covariance Matrix Adaptation Evolution Strategy (CMA-ES)

In this section, we present the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [HO01] algorithm and many of its variants. Section 2.3.1 introduces the algorithm while Section 2.3.2 presents the di erent variants.

Vanilla CMA-ES

The ( = w; ) ES or more commonly called CMA-ES [HO01] is an evolutionary algorithm that was proposed by Hansen et al. in 2001. The parameter represents the number of parents, w the number of parents used in recombination and the number of o springs. CMA-ES uses a multivariate normal distribution N (m; C) to sample candidate solutions with m the mean of the distribution and C, the covariance matrix. In the algorithm, the solution candidates are ranked by tness value, hence xi: is the i-th solution such that f(x1: ) f(x : ). The new candidate solutions, the o springs, are generated with xi: m + N (0; C)2 where
• m 2 Rd;
• 2 R+, the standard deviation or step-size;
• C 2 Md d(R) the covariance matrix, with d the dimension of the problem.
The goal of CMA-ES is to adapt the mean, the step-size and the covariance matrix in order to guide the search, which takes the form of a sequence of sampling that evolves toward best solutions.

Evolution Paths

In general, the information about the evolution of parameters are lost over several generations. The evolution path is a way to keep track of the evolution of a parameter over multiple generations. Evolution path is a sequence of the successive steps that gives information about the search.
The length of the evolution path can be measured: a long path results in consecutive steps in the same directions whereas short paths results in steps that cancel out. This can be used to update parameters, consecutive steps in the same directions could be replaced by a longer step whereas steps that cancel out could be replaced by a shorter step [HO01]. The evolution path can also be expressed as the sum of the length of its consecutive steps. This process is called cumulation [OGH94].
The adaptation of the matrix have several properties [AH21]
• adaption of the covariance matrix learns pairwise dependencies between variables;
• adaption of the covariance matrix performs an iterative Principal Component Analysis (PCA) where the principal components are the eigenvectors of the covariance matrix;
• adaption of the covariance matrix updates of the mean and covariance matrix can be inter-preted as gradient descent;
• the rank- update can increase the learning rate for large populations.
The main di erence between rank-one and rank- updates is the number of vectors used to up-date the covariance matrix. In the algorithm, both rank-one and rank- updates can be combined as in Algorithm 2.

Step-Size Adaptation

On top of the adaptation of the covariance matrix, the step-size is also adapted during the search. The adaptation is done using an evolution path to decide if the step-size should increase or decrease based on its length. In order to decide if the evolution path is either too long or too short, the path length is compared to its expected length. The expected length is computed as every selection of solution candidate was made uniformly at random.
When the path is longer than expected, it means that most of the steps are in the same direction and so will be increased. Conversely, if the path is shorter, will be decreased.

Modular CMA-ES Framework

Over the years, many variants of the CMA-ES algorithm were introduced. The Modular CMA-ES framework [vRWvLB16] (ModCMA) comes with the idea to unify in the same framework most of the CMA-ES variations introduced in the literature. In the framework, these variations are called modules and compose the di erent variants of the CMA-ES algorithm. The general scheme of the algorithm presented in Section 2.3 is the same but the way some components behave might be modi ed, i.e., a di erent way to perform selection, to initialize parameters, to mutate or recombine solution candidates, etc.
The authors described 11 modules with two or three options available for each which results in 4; 608 possible variants. These variants are labeled using a bit string of length 11, each bit representing one module. Hence, the representation 00000000000 labels the vanilla CMA-ES (Al-gorithm 2) while 11111111122 labels the variant with every module activated and the second choice for the two last modules. The 11 modules are the following:
1. Active Update: [JA06] instead of updating the covariance matrix C with only the most suc-cessful mutations, this module updates the matrix by also considering the least successful solution candidates;
2. Elitism: the common selection strategies: comma-selection ( ; ) and plus-selection ( + ) are available. The default option is the non-elitist comma-selection;
3. Mirror Sampling: [BAH+10] originally, all new samples are drawn from the normal distri-bution. When this module is activated, a number of vectors corresponding to half of the population is drawn from a Gaussian distribution. These vectors are then added or sub-tracted to the parents to create symmetric o springs;
4. Orthogonal Sampling: [WEB14] this module was later added to mirror sampling. First, the samples are drawn from a normal distribution (or using mirror sampling). Then, the Gram-Schmidt process is applied to the samples. This method is used for orthonormalizing a set of vectors.
5. Sequential Selection: [BAH+10] instead of evaluating all o springs, this module evaluates sequentially the new solution candidates and stops when any improvement has been found;
6. Threshold Convergence: [PER+15] in order to prevent the search from ending in a local optimum, this module forces the algorithm to stay in the exploration phase by de ning a length threshold that new vectors need to reach. This threshold is then decreased after every generation to transition in a exploitation phase;
7. Two-Point Step-Size Adaptation (TPA): [Han08] in the two-point adaptation variant, two solution candidates are evaluated right after the mean has been updated using a test width parameter. Usually these solution candidates are chosen to be symmetrical to the mean. Depending on their relation, the step-size is either increased or decreased;
8. Pairwise Selection: [ABH11] in mirror sampling, there can be a bias as two mirrored vectors can cancel each other in recombination. To avoid this bias, pairwise selection was introduced and select the best o spring out of each mirror pairs;
9. Recombination Weights: [vRWvLB16] instead of the traditional weight vector w, the arith-metic mean is used wi = 1= ;
10. Quasi-Gaussian Sampling: [AJT05] instead of drawing samples from a Gaussian distribution, new solutions are generated by a quasi-random uniform sequence and then transformed into a normal distribution. The two alternatives proposed are a Sobol sequence or a Halton sequence;
11. Increasing Population Size: during a restart, using the remaining budget more e ectively can be achieved by increasing the population size (IPOP) [AH05]. Then, a bi-population (BIPOP) [Han09] alternative was proposed which alternates with larger and smaller popula-tion sizes.

Di erential Evolution (DE)

Di erential Evolution was rst proposed by Storn and Price [SP97] in the late 1990s, especially for continuous optimization. DE evolves a population of n solution candidates, denoted N P . The population is represented by d-dimensional vectors, d being the dimension of the problem. The main idea in DE is to create a new mutation operator based on vector di erences in order to perturb the population.

Initialization and Mutation

An initial population p0 of k 4 solutions is generated uniformly at random. Its i-th candidate solution is denoted xi = (x1; : : : ; xd) A mutant solution x0 is then generated by adding a pertur-bation vector where = F (y z), with F 2 [0; 2] the scaling factor, and y; z two solutions in the population di erent from x. Di erent variants have been proposed to do this perturbation, some variants propose to choose the best candidate solution found so far to be the mutant instead of a random one, other variants will use ve solution candidates instead of three. Put di erently, four candidate solutions will be used to create the mutant instead of two. In DE algorithms these choice are coined a strategy which is denoted by DE/x/y/z where:
• x is the way the mutant vector is chosen, i.e., randomly or using the best found so far;
• y is the number of candidate solutions used in the mutation, i.e., 1 stands for three candidate solutions and 2 for ve candidate solutions;
• z denotes the crossover scheme, e.g., binomial (bin) or exponential (exp). For the binomial crossover, each coordinate of the o spring may be coming from one or the other parent. For the exponential crossover, blocks of consecutive coordinates of the o spring belong to the rst or second parent. See below for further explanations.

Crossover

In DE algorithms, the crossover probability is noted CR 2 [0; 1]. To avoid that the result of the crossover creates just a copy of the second parent, at least one coordinate, chosen at random, is modi ed with probability 1. Other coordinates are modi ed with probability CR. This crossover scheme is named binomial crossover in DE framework and is the classical uniform crossover in evolutionary algorithms [Zah06]. Algorithm 3 presents the pseudo-code of this crossover type.

Table of contents :

I Introduction
1 Introduction
1.1 Motivation
1.2 Automated Algorithm Selection
1.3 Our Key Findings
1.4 Thesis Outline
II Landscape-Aware Selection of Metaheuristics
2 Continuous Black-Box Optimization Algorithms
2.1 Black-Box Optimization
2.2 Evolutionary Computation
2.3 The Covariance Matrix Adaptation Evolution Strategy (CMA-ES)
2.4 Dierential Evolution (DE)
2.5 Particle Swarm Optimization (PSO)
2.6 Direct Search Methods
2.7 L-BFGS-B Algorithm
2.8 Automated Algorithm Conguration
3 Landscape-Aware Algorithm Selection
3.1 Motivation
3.2 Landscape-Aware Algorithm Selection
3.3 Landscape-Aware Algorithm Selection Pipeline
4 Characterizing Problem Instances via Landscape Features
4.1 Fitness Landscape Analysis
4.2 Exploratory Landscape Analysis
III Analysis of Landscape Features
5 Exploratory Landscape Features Properties
5.1 Design of Experiments
5.2 Stability
5.3 In uence of Sampling Strategy
5.4 Expressiveness
5.5 Robustness
5.6 Invariance to Transformations
5.7 Sensitivity to Noise
5.8 Discussion
IV Optimization of Radar Networks
6 Background on Radar Operation
6.1 Introduction
6.2 History of Radar Development
6.3 Basic Principle
6.4 Radar Equation
6.5 Radar Cross Section
6.6 Swerling Models
6.7 Probability of Detection
7 Radar Network Modeling
7.1 gis: Radar Network Modeling Framework
7.2 Target Characteristics
7.3 Radar Models and Parameters
7.4 Radar Network Use-Cases
7.5 Thesis Use-Cases
7.6 Geographical Data
8 Solving the Radar Network Conguration Problem
8.1 Problem Instances
8.2 Algorithm Portfolio
8.3 Experimental Setup
8.4 Results for the Unconstrained Use-Case
8.5 Results for the Constrained Use-Case
8.6 Comparison with Manual Optimization
9 Landscape-Aware Algorithm Selection on the Unconstrained Use-Case
9.1 Design of the Selector
9.2 Problem Characteristics
9.3 Denition of the SBS
9.4 Selector Performances
9.5 Discussion
V Conclusions
Conclusions
10.1 Summary of Contributions
10.2 Perspectives
A Summary of Papers and Industrial Achievements
A.1 Academic Papers and Presentations
A.2 Industrial Achievements
B Best Performing Algorithm for each Instance
B.1 Best Performing Algorithm in Median
B.2 Best Performing Algorithm for the 2% Quantile
C Radar Network Conguration Contest
C.1 Radar Network Conguration Contest Results
C.2 Radar Network Conguration Contest Poster