Environment of Multi-Armed Bandit

Get Complete Project Material File(s) Now! »

What are Bandit problems?

Bandit problems, are some classic game problems. They can be imagined as the scenario as follows.

In a casino, a gambler faces to many different slot machines. After pulling one arm, he has a certain probability to earn a reward. Obviously, he does not know the rewarding probability of each arm. To obtain the information of these slot machines, the gambler should pull each arm several times. This process is called “Exploration”. Of course, more explorations he takes, more reliable information he gets. For the gambler, his goal is to earn the largest rewards rather than information. Normally, he should pull the arm with highest estimated rewarding probability according to the current information. It is named “Exploitation”. Bandit problems are how the gambler tells the optimal arm and gets the largest rewards in limited time (or limited budget).

Why to research Bandit problems?

Actually, Bandit problems are not only a game theory, but it also plays a big role in other fields. By solving Bandit problems, we can find the answer to solve other similar problems, such as the following examples.
In the process of clinical trials, doctors often encounter a problem: Treating a same disease, there may be two or even more solutions, but their effectiveness is unknown. This is the prototype of Bandit problems. In 1933, Dr.Thompson proposed “Thompson Sampling” (details in Chapter 2) which is a strategy to identify the most effective treatment, preliminarily solving this problem.
With the rising development of Internet, Bandit problems have become extremely popular, such as personalized search and Internet advertising service. For these applica-tions, traditional way is to provide a certain number of recommendations by “Collaborative filtering”. Compared with strategies of Bandit problems (refer to Chapter 3), “Collaborative filtering” is unable to work without references in initial stage. However, Bandit strategies are not affected by this problem.
Otherwise, Google proposed an analytic service [56] based on the Multi-Armed Bandit method (Chapter 2). The purpose of this analytic system is to compare and manage some on-line experiments. For each experimentation, there maybe a variety of alternative solutions. In the past, “A/B Test” was used to solve this problem, but it takes pure exploration during the exploratory phase, and wastes resources with inferior options [81]. Taking the Bayes Theorem combined with the Multi-Armed Bandit, Google could reduce the computational time and get a higher accuracy than before.
Apart from Internet, Bandit problems also have very broad prospects in other domains. During World War II, many countries tried to solve multi-objective multi-tasking attack, and this one could be considered as Multi-Objective Multi-Armed Bandit problem (shown in Chapter 4). Queueing and scheduling [45, 101] is a problem of index policies (refer to Chapter 2), Stock is a Bandit problem in non-stationary environment [50] etc.

How to solve Bandit problems?

The core of Bandit problems is a trade-off between Exploration and Exploitation. Some related strategies and methods will be presented in following Chapters 2, 3, 4, 5, and 6.


From different point of views, Bandit problems can be divided into many categories. They can not be generalized by a uniform modeling. According to the number of trials, Bandit problems can be defined as “Finite horizon” and “Infinite horizon”. The former has a fixed number of trials, the latter is usually unknown in advance. By the number of arms, it can be divided into “Two Arms”, “K arms” and “Many arms”. From the view of system feedback, it can be named “Stationary Bandit”: the distribution of arms is fixed and independant; “Non-Stationary Bandit”: the distribution of arms can be changed during the period of sampling; “Linear Bandit”: all arms are related with each other and satisfy a certain linear function; “Contextual Bandit”: the feedback of arms is associated with side information. Every Bandit problem owns its special model and different strategy. In the next chapters, we will tell about each kind of bandits, their performance, properties and characteristics. Especially, in Chapter 5 and Chapter 6, we propose some novel solutions and algorithms.


Here, we address “Stationary Bandit problem” in order to explain the mechanism of Bandit problems.
Considering Robbins’s sequential decision experiment [86], there are several possible choices to be sampled. Generally, operator makes a choice from a set of options at each time. After sampling, operator receives a boolean feedback 1 or 0 as reward. By observing each option’s feedback, the fixed but unknown probability distribution of option can be estimated empirically (ex. calculate the rewarding expectation). The goal of Bandit problems is to maximize the received rewards from all steps, minimize the cumulative regret and find the optimal option (or a set of optimal options).
In section 1.1, we have introduced that “Exploration” and “Exploitation” are two core elements to solve Bandit problems. By exploration, operator tries different options to get knowledge which one may be rewarding more frequently. By exploitation, operator repeats to choose an option with better performance. Pursuing the goal of Bandit problems, operator should know when to explore or to exploit. The decisions are always dependent on the potential knowledge of the optimal arms. Performing well on Bandit problems requires trading off between Exploration and Exploitation during all decisions. In early steps, it makes sense to explore, to search for those with the highest reward rates. In the later, it makes sense to exploit those arms which considered to be good, finally maximizing the reward for each current decision.
From the theoretical prospect, we also care about how to find optimal solutions for Bandit problems that make model evaluation and model comparisons more efficient. We dedicate the next chapters of this dissertation to our effort on this problem.


Thesis Outline

This dissertation consists of 7 chapters by the discussion of Bandit problems.
Chapters 1, 2, 3 and 4, are about the general introduction and the state-of-the-art of Bandit problems.
Chapter 1, introduces the background of Bandit problems.
Chapter 2, tells about Multi-Armed Bandit (simplified as MAB) problem and introduces some probably effective strategies of trading-off between Exploration and Exploitation.
Chapter 3, describes the classification problem under Bandit Feedback. Here, we intro-duce several algorithms in the state-of-the-art to solve the problems in multi-class/multi-label classification.
Chapter 4, extends Multi-Armed Bandit to the situation where arms are in Multi-Objective (denoted as MOMAB). In this chapter, we introduce some traditional methods to solve MOMAB problems.
Chapter 5 and 6, propose our contributions to Bandit problems.
Chapter 5, proposes some novel algorithms to solve classification with bandit feedback problem, ex. multi-class, multi-label and in Reproducing Kernel Hilbert Space (RKHS). Furthermore, we compare their effectiveness or regret bound by empirical experiments based on some common datasets.
Chapter 6, proposes some optimization methods to find the Pareto front of MOMAB
Chapter 7, is the last part. It makes the conclusions of all chapters in this part and proposes some new directions of research in the future.
Bandit problem, just as expounded in Chapter 1, was formally proposed in 1952 by H.Robbins. In this chapter, we focus on describing one classical Bandit problems – “Multi-Armed Bandit” (MAB). The purpose of this chapter is to introduce some necessary notions, present some effective strategies, define and analyze some specific issues.

Environment of Multi-Armed Bandit

The complexity of Multi-Armed Bandit problems, not only stems from the trade-off between Exploration and Exploitation, but also from the wide range of its environments. Each environment of Multi-Armed Bandit provides a different framework. In this section, we address the characteristics of Multi-Armed Bandit in each environment, and take the K-Armed Bandit as the typical case.

Stationary Bandit

Stationary Bandit [86] is the most classical and common bandit problem. The gambler should pick up an arm from the arm setting, and receive a reward as response from the picked arm. The reward obeys a fixed and independent probability distribution.
The K-Armed Bandit in stationary environment, can be defined as following setting. The gambler faces a slot machine with K arms. Each arm k from the arm set K is characterized by a distribution ”k with its expected reward „k and a variance ¾2k. All of these parameters are not changeable. In each round t > 1, the gambler selects an arm kt and receives a sample drawn from ”kt independently from the past and other arms. Here, the goal of the gambler is to maximize rewards (minimize regret) or to identify the best arm from all arms by estimating the empirical expected reward after T times pulling.

Table of contents :

I General Introduction and State of the art 
1 Introduction 
1.1 Motivation
1.2 Modeling
1.3 Thesis Outline
2 Multi-Armed Bandit 9
2.1 Environment of Multi-Armed Bandit
2.1.1 Stationary Bandit
2.1.2 Adversary Bandit
2.1.3 Contextual Bandit
2.1.4 Linear Bandit
2.2 Gittins index
2.3 The strategy of trade-off
2.3.1 Thompson Sampling
2.3.2 Boltzmann Exploration
2.3.3 Upper Confidence Bound
2.3.4 Epsilon-Greedy
2.4 Regret Lower bound
2.5 Pure Exploration and Best Armed Identification
3 Bandit with side information 
3.1 Multi-class Classification with Bandit feedback
3.1.1 Multiclass Classification
3.1.2 Algorithms for Multi-class Classification with Bandit Feedback
3.2 Multi-label Classification with Bandit feedback
3.2.1 Multilabel Classification
3.2.2 Algorithm for Multi-label Classification with Bandit feedback
4 Multi-Objective Multi-Armed Bandit 
4.1 Multi-Objective Optimization
4.1.1 Front Pareto setting
4.1.2 Dominance method
4.1.3 Aggregation method
4.2 Multi-Objective Optimization in Bandit environment
4.2.1 Algorithms for MOMAB
II Contributions 
5 Passive-Aggressive Classification with Bandit Feedback 
5.1 Multi-class PA with Bandit feedback
5.1.1 Simple PAB
5.1.2 Full PAB
5.1.3 Experiments
5.1.4 Conclusion
5.2 Bandit feedback in Passive-Aggressive bound
5.2.1 Analysis
5.2.2 Experiments
5.2.3 Conclusion
5.3 Bandit feedback with kernel
5.3.1 BPA Online Kernel
5.3.2 Kernel Stochastic Gradient Descent with BPA loss
5.3.3 Experiments
5.4 Bandit PA algorithm for Multi-label Classification
5.4.1 Preliminaries
5.4.2 Analysis
5.4.3 Experiments
6 Optimized Identification algorithm for ²-Pareto Front 
6.1 Identification the ²-Pareto front
6.2 Experiments
IIIConclusion and Perspectives
7 Conclusions 
7.1 Summary of contributions
7.2 Research in the future


Related Posts