Get Complete Project Material File(s) Now! »

## Stochastic dueling bandits

Stochastic dueling bandit problem is characterized by stationarity like its analogue in the classical bandits described in section 1.2.1. However as for the dueling bandits, the learner sees only the relative feedback of the two selected arms and not their reward, the stationarity is exhibited by the preference of one arm over another arm.

### Adversarial dueling bandits

In this formulation, the preference of one arm over another arm is non-stationary. At each time period, the adversary (or the environment) can affect the preference of every arm over every other arm. In both the above categories, the preference of an arm over itself is assumed to be equal to 0.5. This is called into use when the arms are selected with replacement and an arm is selected with itself for a duel. The dueling bandit problem can be formulated by either using utility values for individual arms or preference values for every pair of arms. Both of these formula-tions are described below.

#### Utility-based formulation

In this formulation, every arm is assigned a value. At each time period t, a real-valued bounded utility xa(t) is associated with each arm a 2 A. When arms a and b are selected, their utility values determines which arm wins the duel as follows:

xa(t) > xb(t) : a wins the duel

xa(t) = xb(t) : tie

xa(t) < xb(t) : b wins the duel

A tie can be provided as a separate feedback or it can be broken randomly. Now we describe how both the stochastic and adversarial dueling bandit prob-lem can be encoded in utility-based formulation. Stochastic utility-based dueling bandits Every arm a is associated with a probability distribution a with mean a. When an arm a is selected by the learner, its utility is drawn from a. The arm with the highest mean reward is called as the (an, in case of more than one) optimal arm a . a = argmax a a2A The corresponding highest mean reward is denoted as . This formulation is depicted below: Utilities drawn from 1; : : : ; K x1(t) : : : xa(t) : : : xb(t) : : : xK (t).

The feedback seen by the learner provides it relative information about the selected arms. One of the ways to provide such a relative feedback is as follows: > 1 if xa > xb On pulling arms a and b, the learner receives a reward given by (xa; xb). Depending upon the intent behind formulating the given dueling bandit problem, can take various forms. One way could be that the learner receives the highest of the two rewards i.e.

**Sensitivity Analysis of VAriables for Generic Exploration (SAVAGE)**

Urvoy et al. [2013a] propose a generic algorithm called SAVAGE (for Sensitivity Anal-ysis of VAriables for Generic Exploration) for stochastic preference-based dueling bandits. Their setting does away with several assumptions made in the previous al-gorithms e.g. existence of utility values or a linear order for the arms. In this general setting, the SAVAGE algorithm obtains a cumulative Condorcet regret bound of order O(K2 log T ). The key notions they introduce for dueling bandits are the Copeland, Borda and Condorcet scores ( Charon and Hudry [2010]). The Borda score of an arm a K K on a preference matrix P is b=1 Pa;b and its Copeland score is b=1 1(Pa;b> 1 ) (we 2 use 1 to denote the indicator function). If an arm has a Copeland score of K 1, P P which means that it defeats all the other arms in the long run and it is a Condorcet winner. There exists however some datasets like MSLR30K [2012] where this Con-dorcet condition is not satisfied. It is however possible to define a robust Copeland regret which applies for any preference matrix.

**Table of contents :**

Acknowledgements/Remerciements

**I Introduction and Preliminaries **

**1 Introduction and Preliminaries **

1.1 Sequential decision making

1.2 Multi-armed bandits

1.2.1 Formalization of the problem

Stochastic rewards

Adversarial rewards

1.2.2 Performance measure: regret

1.2.3 Pure exploration setting

1.2.4 Exploration-exploitation setting

1.2.5 Best arm identification

Fixed-confidence setting (PAC setting)

Fixed-budget setting

1.3 Applications of the MAB problem

1.3.1 Clinical trials

1.3.2 Internet advertising

1.3.3 Online recommendation

1.4 Algorithms for the MAB problem

1.4.1 Algorithms for best arm identification with fixed confidence

1.4.2 Algorithms for exploration-exploitation

1.5 Partial monitoring

1.5.1 Examples of partial monitoring game

1.5.2 Hierarchy of finite stochastic partial monitoring games

1.5.3 Expressing MAB problems as partial monitoring game

**II Dueling Bandits **

**2 The Dueling Bandits problem **

2.1 Motivation

2.2 Formalization

2.2.1 Stochastic dueling bandits

2.2.2 Adversarial dueling bandits

2.2.3 Utility-based formulation

Stochastic utility-based dueling bandits

Adversarial utility-based dueling bandits

2.2.4 Preference-based formulation

Stochastic preference-based formulation

Adversarial preference-based formulation

2.2.5 Relation between utilities and preferences

2.3 Our contributions

2.4 Related work

2.4.1 Dueling Bandit Gradient Descent

2.4.2 Interleaved Filtering

2.4.3 Beat The Mean

2.4.4 Sensitivity Analysis of VAriables for Generic Exploration (SAVAGE)

2.4.5 Relative upper confidence bound (RUCB)

2.4.6 Relative confidence sampling (RCS)

2.4.7 Merge relative upper confidence bound (MERGERUCB)

2.4.8 Copeland confidence bound (CCB)

2.4.9 Contextual dueling bandits

2.4.10 Double Thompson sampling for dueling bandits

2.4.11 SPARRING

**3 The Lower Bound **

**4 The Algorithm and its Analysis**

4.1 Relative Exponential-weight Algorithm for Exploration and Exploitation (REX3)

4.2 Illustration of REX3 on a toy problem

4.3 Upper bound on the regret of REX3

**5 Empirical Evaluation **

5.1 Empirical validation of Corollary 4.1

5.2 Interleave filtering experiments

**6 Dueling Bandits as Partial Monitoring games **

6.1 Formalization of dueling bandits as Partial Monitoring game

6.2 Dueling bandits in the partial monitoring hierarchy

6.3 Partial monitoring algorithms and their use for dueling bandits

**Appendix A Detailed Proof of Theorem 4.1 **

A.1 Proof of eq. (A.4)

A.2 Proof of Lemma 4.1

A.3 Proof of eq. (4.5) and (A.8)

**III Corrupt Bandits **

**7 The Corrupt Bandits problem **

7.1 Motivation

7.2 Formalization

7.2.1 Stochastic setting

7.2.2 Adversarial setting

7.2.3 Randomized response as a corrupt bandit problem

7.3 Our contributions

7.4 Related work

7.4.1 Positive and unlabeled learning

7.4.2 Learning in the presence of feature noise

7.4.3 Learning in the presence of label noise

7.4.4 Noise addition for data privacy

**8 The Lower Bounds **

8.1 Lower bound on the sample complexity for best arm identification

8.2 Lower bound on the cumulative regret for exploration-exploitation setting

**9 The Algorithms and their Analyses **

9.1 Algorithms for best arm identification

9.1.1 Median elimination for corrupt bandits

9.1.2 Exponential-gap elimination for corrupt bandits

9.2 Algorithms for exploration-exploitation setting

9.2.1 kl-UCB for MAB with corrupted feedback (kl-UCB-CF)

9.2.2 UCB1 for MAB with corrupted feedback (UCB-CF)

9.2.3 Thompson sampling for MAB with corrupted feedback (TSCF)

9.2.4 kl-UCB-CF and TS-CF for MAB with randomized response

**10 Corrupt Bandits to enforce Differential Privacy **

10.1 Introduction to differential privacy

10.1.1 Why randomization?

10.1.2 Why not simply use cryptosystems?

10.2 Differential privacy in bandits

10.2.1 Motivation for differential privacy in bandits

10.2.2 Previous work

Differentially private UCB Sampling

DP-UCB-BOUND and DP-UCB-INT

10.3 Corrupt bandits for differential privacy

**11 Empirical Evaluation **

11.1 Comparison of dedicated corrupt bandit algorithms with WRAPPER

11.2 Comparison between kl-UCB-CF, UCB-CF and TS-CF

11.2.1 Comparison over a period of time

11.2.2 Comparison with varying corruption parameters

11.2.3 Comparison with varying level of differential privacy

**12 Corrupt Bandits as Partial Monitoring game **

Appendix B Proof for Theorem 9.1

Appendix C Proof for Theorem 9.2

Appendix D Proof for Theorem 9.3

Appendix E Proof for Theorem 9.4

**IV Final Remarks **

**13 Summary and future work**