Adversarial rewards with full information
In the adversarial setting with full information, we now assume that there is a set of (un- known) reward functions (rt)16t6T defined before the game. At the end of each round t, we observe the full function rt. The following Hedge algorithm enables to achieve high perfor- mance bounds in the worst case scenario. Note that here the set A does not need to be finite.
In the case A is finite, the reference measure ν0 is simply the counting measure, while in the general case it is chosen so that ν0(A) < ∞. However in the later case, the following algo- rithm is only theoretical for it may not be possible to sample exactly from the distributions proposed pt. This is discussed for instance in Narayanan and Rakhlin (2010) and in Maillard and Munos (2010b).
A regret analysis similar to the partial information setting can be done for this algorithm, and using standard tools from convex analysis, we can extend both algorithm and analysis to provide very powerful results for online prediction that we do not report here (see for instance Bartlett et al. (2007), Stoltz (2005)). Note, however, that the optimal value for the parameter η = ηt is still an opened question, as explained for instance in chapter 3 of Stoltz (2011), where it is shown that a data-dependent value for η exhibits much better behavior in practice than the standard data-independent optimal value (yet with no theoretical analysis).
The following theorem shows another nice property of this algorithm. and was mentioned in Narayanan and Rakhlin (2010). Here, we derive this result using exponential families.
Finite-time analysis for Bernoulli distributions
In this section, we start with the case of Bernoulli distributions. Although this case is a special case of the general results of Section 4, we provide here a complete and self-contained analysis of this case, where, in addition, we are able to provide closed forms for all the terms in the regret bound. Note however that the resulting bound is slightly worse than what could be derived from the general case (for which more sophisticated tools are used). This result is mainly provided as a warm-up.
In this section, we present tools from Probably Approximately Correct (PAC) analysis, which can be seen as an alternative to analysis based on concentration inequalities. This approach has been popularized over the last few years as it leads to tighter results than concentra- tion inequalities based approach as well as simpler proofs. The small disadvantage of these methods is that we need to sample according to a specific (Gibbs) distribution, which may be hard to do in general, but important progress have been made also in this direction; see Narayanan and Rakhlin (2010) for instance when the potential function is convex. Although we do not explicitly make use of these tools in this Ph.D. thesis (but note they are related to the exponentially weighted forecaster presented in chapter 1), we believe they are very important for the development of future Bandit and Reinforcement Learning theory.
Pac-analysis of regression
In order to motivate the benefit of PAC analysis, we now give a short application to regression, re-deriving one of the results proved in Audibert and Catoni (2010b) for which we hope we provide additional intuition. It is here striking to note the simplicity of the proof as well as the generality and tightness of the bound derived using such technique.
Setting. We consider a regression model of the form Y = f⋆(X) + η where Z = (X, Y ) follows a low P, f⋆ ∈ F and η is a centered noise independent from Z. For some positive real-valued loss function l, we introduce l(f, Z) ∈ R+ the prediction loss of a function f ∈ F for the random variable Z, and then for f, f′ ∈ F, the prediction gap between f and f′ (f, f′) def = l(f, Z) − l(f′, Z).
Table of contents :
Chapter 1 Multi-armed Bandit Games
1 The standard stochastic multi-armed Bandit setting
2 Many extensions to the Bandit setting
3 Exponentially-weighted decision-makers
4 Limitations of the bandit setting
Chapter 2 Bandits with Kullback-Leibler Divergences.
2 Definitions and tools
3 Finite-time analysis for Bernoulli distributions
4 A finite-time analysis in the case of distributions with finite support
5 Technical details
Chapter 3 Bandit Algorithms for Online Learning in Adversarial Lipschitz Environments.
1 Adversarial learning with full information
2 Applications to learning problems
4 Proof of Theorem 3.1 (ALF strategy)
Chapter 4 Adaptive Bandits.
2 Preliminary results
3 Playing against an opponent using a pool of models
5 Technical details
Chapter 5 Statistical Learning.
1 The concentration of measure phenomenon
2 Probably-Approximately-Correct analysis
Chapter 6 Linear Regression with Random Projections.
2 Summary of the method
3 Gaussian objects
4 Regression with random subspaces
6 Technical details
Chapter 7 Brownian Sensing for the Recovery of a Sparse Function.
2 Relation to existing results
3 The “Brownian sensing” approach
5 Numerical Experiments
6 Technical details
Chapter 8 Multiview Learning: Complexity versus Agreement.
2 Setup for multiview semi-supervised learning
3 Empirical complexity bound
5 Stability-based parameter selection
Chapter 9 Finite-Sample Analysis of the Bellman Residual Minimizationalgorithm.
3 Bellman Residual Minimization for Policy Evaluation
4 Bellman Residual Minimization for Policy Iteration
5 Conclusion and comparison with LSTD
6 Technical details
Chapter 10 Least-squares TD with Random Projections.
3 LSTD with Random Projections
4 Finite-Sample Analysis of LSTD with Random Projections
5 LSPI with Random Projections
7 Technical details
Chapter 11 State-Representation in RL
2 Notation and definitions
3 Main results
4 Discussion and outlook
5 Proof of Theorem
Chapter 12 Perspectives and Future Work.