Approximate Dynamic Programming in Zero-Sum Two-Player Markov Games

Get Complete Project Material File(s) Now! »

Least Squares Policy Iteration and Bellman Residual Minimizing Policy Iteration

The Least Squares Policy Iteration (LSPI) algorithm (Lagoudakis and Parr, 2003) and the Bellman Residual Minimizing Policy Iteration (BRMPI) algorithms are approxi-mate PI algorithms. They will perform two steps as PI, and both approximate the policy evaluation steps of PI. Again, as we work with batch data, LSPI and BRMPI will perform their iterations on the Q-function instead of the value function. At iter-ation k, the PI algorithm’s evaluation step computes the state-action value function Qk = Qπk . In the case of LSPI and BRMPI, the Q-function are lying in a d dimen-sional linear function space FΦ = {Φω, ω ∈ Rd}. The features Φ = [φ1, . . . , φd] are linearly independent functions from S × A → R and can be thought of as vectors of size |S| × |A| in the vector space of canonical basis {1s,a}s,a∈S×A. The feature Φ can be seen as a matrix of size (|S| × |A|) × d and ω a vector of size d. Furthermore, let ρ be a probability distribution over the state-action space S × A and we will write Δρ the diagonal application that maps 1s,a to ρ(s, a)1s,a. Finally, the orthogonal projection onto FΦ with respect to the Lρ,2-norm is the application Φ(Φ>ΔρΦ)−1Φ>Δρ = Πρ,Φ. The Lρ,p of function f is defined here as kfkp = P ρ(x)f(x)p. ρ,p x∈X Least Squares Policy Iteration: At each iterations for a policy π, the PI algorithm finds Q satisfying Q = TπQ. Instead, LSPI finds ω satisfying the projected fixed point equation: Φω = Πρ,ΦTπΦω. (2.25) The solution for ω of this equality is given by: ω = Aρ,−1Φ,πbρ,Φ, (2.26) where Aρ,Φ,π = Φ>Δρ(Φ − γPπΦ), (2.27) and bρ,Φ = ΦΔρr. (2.28)

Approximate Modified Policy Iteration

In the two previous sections (Section 1.2.1 and 1.2.2) we introduced approximations of VI and PI. These approximations are made in the evaluation step (i.e. vk = Tπk vk−1 for VI and vk = (Tπk )+∞ vk−1 for PI). In this section, we present results that shed light on the robustness of these algorithms to errors. Imagine that instead of performing the iteration vk = Tπk vk−1 (where πk ∈ G(vk−1)), the update of vk is made up to an error k such that vk = Tπk vk−1 + k. But approximations can also occur in the greedy step (Gabillon et al., 2011, Scherrer et al., 2012) as discussed in Section 1.2.1 and, instead of taking a greedy action, some algorithms select suboptimal actions in some states. We shall write π ∈ G (v) if T ∗v ≤ T ˆ 0 πv + 0. In this section we provide the state of the art sensitivity analysis of Approximate Modified Policy Iteration (AMPI) like algorithms. The generic scheme studied is described in Algorithm 7 and applies to all algorithms described in Section 1.2.1 and in Section 1.2.2. The following analysis is standard in approximate dynamic programming algo-rithms. As this kind of analysis will be done for several algorithms in Part II we detail here the case of AMPI of Scherrer et al. (2012). These bounds are decomposed in a sum of 3 terms: the value update error, the greedy error and a concentration term. Each of these terms is decomposed in a product of 3 quantities. All these terms can be controlled: the first one can be controlled by the accuracy of the value function update, the second one can be controlled making small approximation on the greedy step and the last one by the number of iterations of the algorithm. These analysis were performed in L∞-norm and then were performed in Lp-norm (Farahmand et al., 2010, Antos et al., 2008a, Munos, 2007). The performance of AMPI after k-iterations measures in Lρ,p-norm the distance between vπk and v∗. The measure ρ specifies the distribution over the state space on which the performance needs to be accurate. Thus, we will bound the norm kv∗ − vπk kρ,p. Since our method performs approximations at each iteration, our bound will depend on 1, . . . , k−1 and 01, . . . , 0k. These errors are supposed to be controlled in Lσ,pq0-norm where σ is the distribution on which the estimation is supposed to be accurate. For instance in Fitted-Q iteration, j, j ≤ k the error of the regression at iteration j. In that case, σ is the distribution of the data and d = 2 (since the L2 loss is often used for trees). Theorem 2.3 shows the impact of the approximation on the final solution.

general-Sum Markov Games

General-sum Markov Games (MG) are a generalization of MDPs for the multi-player scenario and a generalization of NFGs to environments where the interactions also depend on state. In an MG, N players evolve in a discrete state space S. As in MDPs, the interaction is sequential and at time t, players belong to state st. In that state, they all simultaneously choose an action ait in the action space Ai (ait is the action of player i at time t) and receive a reward ri(st, a1t, . . . , aNt ). Again, we will write a = (a1, . . . , aN ) = (ai, a-i) where a-i is the joint action of every player except player i. Then, the state changes to state st+1 ∼ p(.|st, a1t, . . . , aNt ) where p(.|s, a1, . . . , aN ) is the transition kernel of the MG. The constant γ ∈ [0, 1] is the discount factor of the MG. Finally, an MG is a tuple < S, {Ai}i∈{1,…,N}, {ri}i∈{1,…,N}, p, γ >.
In an MG, each player’s goal is to find a strategy {πti}t∈R. The strategy πti at time t is a mapping from the state space to the a distribution over the action space (i.e. πti ∈ ΔAS). These strategies are independent and actions are supposed to be chosen simultaneously. Again, we will define {πt}t∈N as the joint strategy of all players and {πt-i}t∈N as the strategy of every player except i. If the strategy is stationary, it will be written πi, π and π-i. Furthermore, we will often use short notations such as π(b|s) = π1(b1|s) × · · · × πN (bN |s) or π-i(b-i|s) = π1(b1|s) × · · · × πi−1(bi−1|s) × π i+1(bi+1|s) × · · · × πN (bN |s).
The value for player i in state s is his expected γ discounted sum of rewards starting from state s when all players play their part of the joint policy {πt}t∈N: v{iπt}t∈N (s) = E « +∞ γtri(st, at)|s0 = s, at ∼ πt(.|st), st+1 ∼ p(.|st, at)# . (2.53) X t=0 When this strategy is stationary, the value function of player i is defined as: vπi(s) = E » +∞ # . (2.54) γtri(st, at)|s0 = s, at ∼ π(.|st), st+1 ∼ p(.|st, at) X t=0
In the case of stationary strategies, if every player except player i sticks to his strategy π-i the value the best response player i can achieve is: v i s max E » +∞ γtri s , a )| s 0 = s, a ∼ π . s ) , s ∼ p . s , a . (2.55) π∗-i ( ) = πi X ( t t t ( | t t+1 ( | t t)#

Zero-Sum Two-Player Markov Games

As for NFGs we will introduce specific notations and highlight the properties specific to zero-sum two-player MGs. In a zero-sum two-player MG the reward of Player 1 is the loss of Player 2. Thus, for all s, a1, a2 ∈ S × A1 × A2 we have r1(s, a1, a2) = −r2(s, a1, a2) = r(s, a1, a2). In this setting, player 1 will attempt to maximize his rewards while player 2 will attempt to minimize his losses. In this specific setting, we will define the strategy of player 1 (µ) instead of π1 and the strategy of player 2 (ν) instead of π2. The value of the joint policy (µ, ν) is: vµ,ν = vµ,ν1 = vπ11,π2 . The value of the best response of player 2 to the strategy of player 1 µ is:
vµ = min vµ,ν = −v∗2. ν µ.
In an MG, there exists a unique value for all Nash equilibrium:
v∗ = min max vµ,ν = max min vµ,ν . ν µ µ ν.

Table of contents :

1 Summary of the Notations
Part I. Introduction, Background, and Related Work
Chapter 1 Introduction
1 Structure of the dissertation
2 Contributions
Chapter 2 Background and Related Work
1 Markov Decision Processes
1.1 Exact Algorithms
1.1.1 Value Iteration
1.1.2 Policy Iteration
1.1.3 Modified Policy Iteration
1.1.4 Minimization of the Bellman Residual
1.2 Batch Algorithms
1.2.1 Fitted-Q iteration and Neural Fitted-Q
1.2.2 Least Squares Policy Iteration and Bellman Residual Minimizing Policy Iteration
1.2.3 Approximate Modified Policy Iteration
1.2.4 Bellman Residual Minimization
1.3 Online Learning in MDPs
1.3.1 Q-learning
1.3.2 SARSA
2 Normal-Form Games
2.1 Nash equilibrium:
2.2 Zero-Sum Normal-Form Games
3 General-Sum Markov Games
3.1 Nash Equilibrium -Nash Equilibrium
3.2 Bellman Operator in General-Sum Games
3.3 Zero-Sum Two-Player Markov Games
3.4 Exact Algorithms for Zero-Sum Two-Player Markov-Games
3.4.1 Value Iteration
3.4.2 Policy Iteration by Hoffman and Karp
3.4.3 The Algorithm of Pollatschek and Avi-Itzhak
3.4.4 Generalized Policy Iteration
3.4.5 Minimizing the Bellman Residual: Filar and Tolwinski’s Algorithm
3.5 Batch Algorithms for Zero-Sum Two-Player Markov-Games
3.5.1 Least-Squares Policy Iteration
3.6 Exact Algorithms for General-Sum Markov-Games
3.7 Independent Reinforcement Learning for Markov Games
3.7.1 Q-Learning Like Algorithms
3.7.2 Independent Policy Gradient
3.7.3 Fictitious Play
Part II. Approximate Dynamic Programming in Zero-Sum Two-Player Markov Games
Chapter 3 Approximate Dynamic Programming in Games
1 Approximate Dynamic Programming: A Unified Scheme
1.1 Approximate Generalized Policy Iteration
1.2 Error Propagation
2 Empirical Evaluation
2.1 Algorithm
2.2 Analysis
2.3 Complexity analysis
2.4 The Game of Alesia
3 Conclusion and Perspectives
4 Appendix: Demonstration of Lemma 3.2
4.1 Demonstration of Theorem 3.1
5 Appendix: Bound for AGPI-Q
6 Appendix: Experiments
6.1 Mean square error between approximate value and exact value .
6.2 Exact value function, approximate value function and error for
N = 10000
Chapter 4 Improved Approximate Dynamic Programming Algorithms using non-stationary Strategies
1 Non-Stationary Strategy in Zero-Sum Two-Player Markov Games
2 Algorithms
2.1 Value Iteration and Non-Stationary Value Iteration
2.2 Policy Search by Dynamic Programming (PSDP)
2.3 Non Stationary Policy Iteration (NSPI(m))
2.4 Summary
3 Empirical Evaluation
4 A Comparison
5 Conclusion
6 Appendix: NSVI
7 Appendix: PSDP
7.1 Appendix: PSDP2
7.2 Appendix: PSDP1
8 Appendix: NSPI
9 Appendix: Figures
Chapter 5 On the use of non-stationary strategies
1 Background on Cyclic Strategies and Nash Equilibrium
2 The Value Iteration Algorithm for General-Sum MGs
3 Illustrating example of lower bound
4 Approximate Value Iteration
5 Experiments
6 Conclusion
7 Proof of Theorem 5.1
8 Proof of Theorem 5.2
Part III. Learning in Games : A Bellman Residual Approach
Chapter 6 Bellman Residual Minimization in Zero-Sum Games
1 Background
2 Newton’s Method on the OBR with Linear Function Approximation
2.1 Newton’s Method on the POBR
2.2 Newton’s Method on the OBR
2.3 Comparison of BRMPI and Newton-LSPI
3 Batch Learning in Games
3.1 Newton-LSPI with Batch Data
3.2 BRMPI with Batch Data
4 Quasi-Newton’s Method on the OBR and on the POBR
5 Experiments
5.1 Experiments on Markov Decision Processes
5.2 Experiments on Markov Games
6 Conclusion
7 Appendix
7.1 Remark
7.2 Proof of Theorem 6.1
7.3 Proof of Theorem 6.2
8 Computation of the Gradient and of the Hessian
Chapter 7 Bellman Residual Minimization in General-Sum Games
1 Nash, -Nash and Weak -Nash Equilibrium
2 Bellman Residual Minimization in MGs
3 The Batch Scenario
4 Neural Network Architecture
5 Experiments
6 Conclusion
7 Proof of the Equivalence of Definition 2.4 and 7.1
8 Proof of Theorem 7.1
9 Additional curves
Part IV. Independent Learning in Games
Chapter 8 Actor-Critic Fictitious Play
1 Specific Background
2 Actor-Critic Fictitious Play
3 Fictitious play in Markov Games
4 Stochastic Approximation with Two-Timescale
5 Experiment on Alesia
6 Conclusion
7 Proof of Lemma 8.2
8 Proof of Theorem 8.1
9 Convergence in Cooperative Multistage Games
10 Proof of proposition 1
11 Proof of Proposition 2
12 Proof of Proposition 3
13 Analysis of Actor-Critic Fictitious Play
14 On the Guarantees of Convergence of OFF-SGSP and ON-SGSP
Part V. Conclusions and Future Work
1 Conclusion
2 Future Work
Bibliography