Get Complete Project Material File(s) Now! »

## The Multi-Armed Bandits Model

The problem of sequentially allocating resources to a defined set of actions (arms) based on successive partially observable (see Definition 2.1 and Remark 2.1 below) feedback refers to the MAB game in probability theory. The term bandit is named, by analogy, after slot machines (or one-armed bandits) in a casino. A sequential decision making problem comes up then when facing with several slot machines (multi-armed bandits).

The study of MAB problems can date back to as early as 1933 [Thompson, 1933], and was originally proposed to model sequential clinical trials. For example, researchers test-ing the efficacy of potential vaccines for a new coronavirus have to choose a vaccine (arm) from the following 4 options as shown in Fig. 2.1 on each patient from an experimental group of N person. For each patient n 2 [N], researchers receive a reward signal rn 2 {0, 1}. r n ˘ 1 indicates that the vaccine is effective, otherwise the vaccine fails. We thus assume that the efficacy of each vaccine follows some Bernoulli distribution that is unknown to the researchers.

### Common assumptions on the rewards

One important thing to take into consideration before starting any bandit game is to take care of the assumptions on the rewards. Intuitively, we shall have a minimum prior knowl-edge of the ‘shape’ of the rewards. Obviously, the less the learner knows about that shape, the more difficult the problem is. In this thesis, several different assumptions on the re-ward distributions are considered depending on the problem settings. In the next, we offer a brief overview of commonly used assumptions in the literature.

Bounded rewards. The mostly considered assumption is whether the supports of the reward distributions are bounded, and if so, whether the bounds are known to the learner. In that latter case, we can assume without loss of generality that the rewards are supported on [0, 1]. Indeed, if the rewards are contained in an arbitrary bounded interval [a,b], then we can simply apply a normalization trick to recover the [0, 1] case. The previous vaccine example of Fig. 2.1 with Bernoulli bandits is a typical example of known bounded rewards. Bounded rewards are widely used in many MAB research work. It is also the case for instance in Chapter 5 of this thesis. One-dimensional exponential family. Unbounded reward distributions are obviously considered in the literature as well. An usual example of infinitely-supported reward dis-tributions is Gaussian distribution. Therefore in the literature, we sometimes think of a more general parametric framework, namely the exponential family. The exponential family contains a large set of natural distributions as Bernoulli distributions and Gaus-sian distributions, hence covers a wide range of both bounded and unbounded rewards.

In practice, we often further consider a specific sub-family of distributions that is the one-dimensional exponential family or single-parameter exponential family. Typical distributions in the one-dimensional exponential family include Bernoulli distributions, Gaussian distributions with known variance, etc. Formally, given a random variable X whose probability distribution belongs to the single-parameter exponential family, then its probability density function (or probability mass function if X is discrete), depending only on one single parameter µ, can be written as.

pX(x j µ) ˘ b(x) exp£·(µ) ¢T(x) ¯ A(µ)⁄ , (2.3)

where T(X) is the natural sufficient statistic and b,·, A are known functions. A more formal reminder of one-dimensional exponential family is given in Appendix A.

The MAB community is interested in exponential family not only because it covers a large family of most common distributions, but also because it holds some nice proper-ties for statistical analysis. For example, exponential family has sufficient statistics that can summarize arbitrary amounts of independent and identically distributed (iid) data with a finite number of samples, which is a great property in bandit analysis. Another important fact is that exponential family distributions have conjugate priors, which is ex-tremely useful in Bayesian statistics. The latter one is for example used in Chapter 3.

Beyond… More general reward distributions are also considered in the literature. To list a few of them, we can think of sub-Gaussian distributions whose tails decay at least as fast as Gaussian distributions (see Appendix A.1.2 for a reminder of the definition), and also heavy-tailed distributions whose tails are not exponentially bounded (see e.g; Bubeck et al. 2013; Yu et al. 2018). Those reward distributions also incite interesting theoretical questions as well as applications, but are out of the scope of this manuscript.

#### Regret minimization

Once we have imposed some assumptions on the reward distributions, the next step is to fix a learning goal and set an evaluation measure accordingly.

As previously stated that the classical learning objective of a MAB learner is to maxi-mize the total return in the long run, hence trades-off between exploration and exploita-tion. To achieve that goal, the learner needs to design a clever (in a precise sense) way of pulling arms based on past observations, and we call this design an allocation strategy or policy. To evaluate a strategy under this reward maximization setting, one can use the metric often referred to as regret defined in the next. Suppose that each unknown distribution ”k is associated with a mean „k , and that „? is the mean of the optimal arm. One natural way to assess the quality of the given policy would be minimizing the total loss w.r.t the optimal arm during the whole process, which leads to the notion of cumulative regret (sometimes simply called regret if there is no ambiguity).

**Two frameworks of best-arm identification**

Recall that for the vanilla problem setup of BAI, we consider a finitely-armed bandit model, which is a collection of K probability distributions, called arms X , {x1,¢¢¢ , xK}, parame-terized by their means „1,¢¢¢ ,„K. When clear from the context, we can simply denote the arms by {1, 2,¢¢¢ , K}. We assume the (unknown) best arm is unique and we denote it by I?„ , arg maxi „i 3.

A BAI strategy or algorithm can be characterized by a triple (In , Jn ,¿) at each time step, hence consists of three components:

• The first is a sampling rule, which selects an arm In 2 [K]. Recall that in a MAB prob-lem, a vector of rewards (rn,1,¢¢¢ ,rn,K) is generated for all arms independently from past observations at each round, but only rn ˘ rn,In is revealed to the learner. Note that In is Fn¡1-measurable, i.e., it can only depend on the past n ¡ 1 observations, and some exogenous randomness, materialized into Un¡1 » U([0, 1]);

• The second component is a Fn -measurable decision rule Jn , which returns a guess for the best arm;

• And thirdly, the stopping rule ¿, a stopping time with respect to (Fn )n2N, decides when the exploration is over.

In general, there are two learning frameworks of BAI: (1) fixed-confidence setting, first studied by [Even-dar et al., 2003] and (2) fixed-budget setting, first proposed by [Audibert and Bubeck, 2010].

Fixed-budget setting. In the fixed-budget setting, the learner tries to maximize the prob-ability of returning the best (or « -best) arm with a fixed horizon N. Therefore, the stopping rule in this case can be simply written as ¿ ˘ N. The protocol of fixed-budget BAI can be summarized as below.

**Table of contents :**

0.1 Contexte de la thèse

0.2 Banditmanchotmulti-bras et optimisation

List of Algorithms

List of Figures

List of Tables

**1 Introduction **

1.1 Context of the Thesis

1.2 Multi-Armed Bandits and Optimization

1.3 Publications of the PhD

**2 StochasticMulti-Armed Bandits **

2.1 TheMulti-Armed BanditsModel

2.2 Best-ArmIdentification

2.3 Extensions of Best-ArmIdentification

2.4 Many-armed bandits

2.5 PerformanceMeasure

**3 A Bayesian Study of Best-ArmIdentification **

3.1 Introduction

3.2 Bayesian BAI Strategies

3.3 Two Related Optimality Notions

3.4 Fixed-Confidence Analysis

3.5 Optimal Posterior Convergence

3.6 Numerical Illustrations

3.7 Discussion

**4 Optimal Algorithms for Linear Best-ArmIdentification **

4.1 Introduction

4.2 Problem Setting and Assumptions

4.3 Fixed-Confidence Optimality and Complexities

4.4 RelatedWork

4.5 Bayesian Algorithms for the Linear Case

4.6 A Gamified Algorithm

4.7 Other Saddle-Point Approaches

4.8 Discussion

**5 Hierarchical Bandits for Black-Box Optimization **

5.1 Introduction

5.2 Required Assumptions

5.3 General Parallel Optimization

5.4 HCT under Local Smoothness w.r.t. P

5.5 Experimental Illustrations

5.6 Discussion

**6 Bandits and Hyper-Parameter Optimization **

6.1 Introduction

6.2 A Brief Survey of AutomatedMachine Learning

6.3 Hyper-Parameter Optimization Framework

6.4 Best-ArmIdentification for Hyper-Parameter Tuning

6.5 Active TTTS for Hyper-Parameter Optimization

6.6 Experiments

6.7 Adaptivity to ¹?

6.8 Discussion

**7 General Conclusion and Perspectives **

7.1 General Discussion

7.2 Future Perspectives

**A Mathematical Tools **

A.1 Some Reminders on Probability

A.2 Concentration Inequalities

A.3 Information Theory

**B Additional Proofs of Chapter 3 **

B.1 Notation

B.2 Technical Lemmas

B.3 Fixed-Confidence Analysis for TTTS

B.4 Fixed-Confidence Analysis for T3C

B.5 Proof of Lemma 3.1

B.6 Proof of Posterior Convergence for Gaussian Bandits

B.7 Proof of Posterior Convergence for Bernoulli Bandits

**C Additional Proofs of Chapter 4 **

C.1 Notation

C.2 Technical Lemmas

C.3 Sample Complexity of LinGame

C.4 A Fair Comparison of Stopping Rules

**D Additional Proofs of Chapter 5 **

D.1 Notation

D.2 Detailed regret analysis for HCT under Assumption 5.2

D.3 General analysis of POO

**E Acronyms **

**F Glossary**