Get Complete Project Material File(s) Now! »

## Reinforcement Learning

Reinforcement learning [94] is learning from interaction with an environment to achieve a goal. It is a powerful framework to solve sequential decision-making prob-lems. The agent discovers which actions produce the greatest reward by experiencing actions and learns how good it is for the agent to be in a state over the long run, called the value of state, or how good it is to take a certain action in a given state over the long-term, quanti ed by the value of action. Reinforcement learning aims to maximize the total reward in the long run. Rewards are given immediately by se-lecting an action but values of states (or actions) must be estimated from an agent’s experience. Since states (or actions) with the highest values can bring about the greatest amount of reward over the long run, we are most concerned with the value of state (or action) when making decisions.

In this chapter, we start with Markov decision processes (MDPs) which are a key formalism for reinforcement learning. Then, we describe the fundamental meth-ods for solving MDP problems, such as dynamic programming (DP) and temporal-di erence (TD) methods. These methods based on tabular forms can be extended into function approximation methods that can be applied to much larger problems. The remaining sections present some important topics or improvements from the literature.

**Markov Decision Processes**

The notion of Markov Decision Process (MDP) underlies much of the work on re-inforcement learning. An MDP is de ned as a 4-tuple M = hS; A; R; T i where S is a set of states, A is a set of actions, R : S A ! R is a reward function, and T : S A S ! [0; 1] is a transition function. The reward function returns a single number, a reward, for an action selected in a given state. The transition function speci es the probability of transition from state s to state s0 on taking action a (denoted T (s; a; s0) or, simply, Pr(s0 j s; a)). A nite MDP is an MDP in which the sets of states S, actions A, and rewards R have a nite number of elements.

An entity that learns and makes decisions is called the agent and everything outside the agent is called the environment. The agent learns from continual inter-action with an environment to achieve a goal. The agent selects actions and the environment responds to these actions by giving a reward and presenting new state.

The objective of the agent is to maximize the total amount of reward it receives in the long run.

We usually describe the interaction between the agent and the environment with a sequence of discrete time steps. At time step t, the agent is given a state st and selects an action at on the basis of the current state. At time step t + 1, the agent receives a reward rt+1 and new state st+1 as a result of taking action.

In the MDP environment, a state should retain all relevant information, though we are not concerned with the complete history of states that led to it. We say that the state has the Markov property: if the state is a Markov state, then the environ-ment’s response at time t + 1 depends only on the state and action representations at time t. This Markov property enables to predict the next state and reward given the current state and action. Relevant information about states and actions are typically summarized in a compact form.

A policy : S A ! [0; 1] maps states to probabilities of selecting an action. The policy represents the agent’s action selection in a certain state s. In any MDP, there is a policy that is better than or equal to all other policies for all states. This is called an optimal policy, denoted . The goal of the agent is to nd an optimal policy that maximizes the total reward in the long run.

Rewards and Values. At each time step, the agent receives a reward from the environment as a consequence of its behavior. The objective of the agents is to maximize the total amount of reward it receives in the long run. To achieve the objective, the agent has to estimate the expected total amount of reward starting from a state, called the value of state. While a reward characterizes how good the action is in an immediate sense, a value function of state measures how good it is for the agent to be in a given state over the long run. A reward is given directly from the environment but a value function must be learned from the agent’s experience. In reinforcement learning, when making decisions, the agent does not focus on immediate rewards but on values, i.e., cumulative reward in the long run. A state might yield a low immediate reward but it does not mean that the following state also brings about a low reward. If a state with a low reward is followed by some states that yield high rewards, it has a high value. Thus, the agent has to follow states with the highest values not the highest immediate rewards because those states bring about the greatest amount of reward over the long run.

Value Functions. For each time step, the agent selects an action and then learns from its experience. By repeated action-selection behavior, the agent learns the value of being in a state and taking an action in a state. The value functions are de ned with respect to policies. The value of a state s under a policy , denoted v (s), is de ned as the expected future rewards when starting from state s and following policy , using a discount factor 0 1

**Dynamic Programming**

Dynamic programming (DP) is a collection of algorithms that assume a complete and perfect model of the environment’s dynamics as an MDP and compute optimal policies using the model. The model of the environment’s dynamics means the transition function and the reward function. Since DP requires a prior knowledge of a complete model and a great computational expense, DP is of limited applicability but it is an essential foundation of reinforcement learning. DP algorithms use value functions to obtain optimal policies. We present two fundamental DP methods, policy iteration and value iteration, in the following subsections.

**Policy Iteration**

Policy iteration consists of two processes, policy evaluation and policy improvement. Policy evaluation computes the value functions consistent with a given policy and policy improvement makes the policy greedy with respect to the value function obtained in policy evaluation.

Policy Evaluation In policy evaluation, the state-value function v is computed for an arbitrary policy . We assume that the environment’s dynamics are com-pletely known. The state-value function v can be obtained iteratively by using the Bellman equation.

vk(st+1) j st = s] (2.9) vk+1(s) = E [rt+1 = X (a j s) X Pr(s0 j s; a) [R(s; a) + vk(s0)] (2.10) a s0 where (a j s) is the probability of taking action a in state s under policy . The sequence (vk) converges to v as k ! 1 if < 1 or eventual termination is guaranteed from all states under the policy [94]. The iteration is stopped when the changes of value functions are lower than a threshold . The iterative policy evaluation algorithm is shown as follows:

Algorithm 1 Iterative policy evaluation

1: Input: , the policy to be evaluated

2: Initialize an array V (s) = 0, for all s 2 S

3: repeat

4: 0

5: for each s 2 S do

6: v V (s)

7: V (s) P a (a j s) P s0 Pr(s0 j s; a) [R(s; a) + V (s0)]

8: max( ; v V (s) )

9: end for

10: until < (a small positive number)

11: Output V v

Policy Improvement We obtained the value function v under an arbitrary de-terministic policy from the policy evaluation step. In the policy improvement method, we consider whether it is better to change the current policy to the new policy 0. For example, for some state s, we keep following the current policy (s) or select an action a 6= (s). If it is better to select a in s and thereafter follow the existing policy than it would be to follow all the time, we have to change the current policy to new policy 0 with 0(s) 6= (s). This is generalized in the policy improvement theorem.

**Theorem 1** (policy improvement theorem). Let and 0 be any pair of determin-istic policies such that, for all s 2 S, q (s; 0(s)) v (s). Then the policy 0 must be as good as, or better than, . That is, it must obtain greater or equal expected return from all states s 2 S: v0 (s) v (s).

According to the policy improvement theorem, we can de ne the new greedy policy 0 by 0 : (2.11) (s) = arg max q (s; a) = arg max E [r t+1 + v (s t+1 ) j s t = s; a t = a ] (2.12) = arg max X Pr(s0 j s; a) [R(s; a) + v (s0)] (2.13) a s0

The new policy 0 improves on an original policy by greedily taking actions ac-cording to value function v . The deterministic policy we have seen above can be extended to the general form, a stochastic policy that is speci ed by probabilities (a j s) for taking each action a in each state s. Policy Iteration In the policy iteration method, we repeat policy evaluation and policy improvement until convergence to an optimal policy and optimal value func-tion. Given an arbitrary policy , we compute value function v and improve the policy with respect to value function v to yield a better policy 0. Then we compute again v0 and improve 0 to get a better policy 00 and so on. A complete algorithm is as follows:

Algorithm 2 Policy Iteration (using iterative policy evaluation)

1: V (s) 2 R and (s) 2 A(s) arbitrarily for all s 2 S

2:

3: // 1. Policy Evaluation

4: repeat

5: 0

6: for each s 2 S do

7: v V (s)

8: V (s) s0 Pr(s0 j s; (s)) [R(s; a) + V (s0)]

P j v V (s) j )

9: max( ;

10: end for

11: until < (a small positive number)

12:

13: // 2. Policy Improvement

14: policy-stable true

15: for each s 2 S do

16: old-action(s)

17: (s) arg max Ps0 Pr(s0 j s; a) [R(s; a) + V (s0)]

18: If old-action 6= (s), then policy-stable f alse

19: end for

20: If policy-stable, then stop and return V v and; else go to 4

### Value Iteration

In policy iteration, each iteration involves policy evaluation that repeatedly sweeps through the state space. The policy evaluation step can be truncated without losing the convergence guarantees of policy iteration [94]. When policy evaluation is used just once, we call this value iteration. Then we combine the policy evaluation and the policy improvement steps in a simple update operation:

v s : max E [r + v (s ) j s = s; a = a] (2.14)

k+1( ) = a t+1 j k t+1 t t )] (2.15)

= a Xs0 Pr( s0 ) [ ( ) + k( s0

max s; a R s; a v

In Eq. (2.15), an action is greedily selected with respect to the current value function and the selected greedy action is used to update the value function. A complete algorithm of value iteration is as follows:

Algorithm 3 Value Iteration

1: Initialize array V arbitrarily (e.g., V (s) = 0 for all s 2 S)

2: repeat

3: 0

4: for each s 2 S do

5: v V (s)

6: V (s) maxa Ps0 Pr(s0 j s; a) [R(s; a) + V (s0)]

7: max( ; j v V (s) j)

8: end for

9: until < (a small positive number)

10: Output a deterministic policy,, such that

11: (s) arg max Ps0 Pr(s0 j s; a) [R(s; a) + V (s0)]

**Interaction between Policy Evaluation and Policy Im-provement.**

In policy iteration, two processes, policy evaluation and policy improvement, alter-nate. When one process completes then the other process begins. Policy evaluation makes the value function consistent with the current policy. Then the updated value function is used to improve the policy. In policy improvement, the policy is changed with respect to the current value function. As the interaction of two pro-cesses continues, the policy and the value function move toward the optimal. When the policy and the value function are not changed by either process then they are optimal. The interaction between the policy evaluation and policy improvement processes underlies almost all reinforcement learning methods [94]. Whereas policy iteration separates them as two di erent processes, value iteration merges them in one process.

**Temporal-Di erence Methods**

While dynamic programming (DP) in section 2.2 needs a complete and accurate model of the environment, temporal-di erence (TD) methods do not require prior knowledge about the environment’s dynamics. They compute value functions using raw experience in an on-line, fully incremental manner. For example, at time t, if the agent takes action at in state st under policy , the action causes a transition to st+1 with reward rt+1. With this experience, the TD method updates the value function of st by: V (st) V (st) + [rt+1 + V (st+1) V (st)]

The quantity in brackets in the update is called the TD error. : + V (st+1) V (st) t = rt+1

It is the di erence between the current estimated value of st and the better estimated value based on the actual observed reward and the estimated value of the next state, st+1, i.e., rt+1 + V (st+1). As value functions are repeatedly updated, the errors are reduced. Here, is a positive fraction such that 0 < 1, the step-size parameter that in uences the rate of learning. When = 1, the agent considers only the most recent information for learning. If is properly reduced over time, the function converges [94].

TD methods are the most widely used methods due to their several advantages such as computational simplicity, on-line learning approach, and learning directly from experience generated from interaction with an environment.

In TD methods, each iteration of value updates is based on an episode, a sequence of state transitions from a start state to the terminal state (or, in some cases, till some other condition has been reached, for example, the limited number of time steps, etc.). For example, at time t, in state s, the agent takes an action a according to its policy, which results in a transition to state s0. At time t + 1 in the successor state of s, state s0, the agent takes its best action a0 followed by a transition to state s00 and so on until the terminal state. Each episode starts in a starting state or any randomly selected state and ends in the terminal state.

A complete algorithm for the TD method is shown in Algorithm 4.

Algorithm 4 Tabular TD for estimating v

1: Input: the policy to be evaluated

2: Initialize V (s) arbitrarily (e.g., V (s) = 0, for all s 2 S)

3: repeat for each episode

4: Initialize s

5: repeat for each step of episode

6: a action given by for s

7: Take action a, observe r; s0

8: V (s) V (s) + [r + V (s0) V (s)]

9: s s0

10: until s is terminal

11: until

We now turn to action-value functions rather than state-value functions. There are two main approaches for learning q in TD methods: on-policy and o -policy.

On-policy TD method: Sarsa First we consider an on-policy TD method called Sarsa [94]. In on-policy method, we estimate q (s; a) for the current behavior pol-icy and change toward greediness with respect to q . This method learns action-values based on transitions from a state-action pair to a state-action pair. Since it uses a tuple of transition experience (st; at; rt+1; st+1; at+1)for each update, it is named Sarsa and an action value is updated by:

Q(st; at) Q(st; at) + [rt+1 + Q(st+1; at+1) Q(st; at)] (2.18)

The pseudocode of Sarsa is given in Algorithm 5. An action can be greedily selected all the time, called greedy policy. Alternatively, most of the time, the agent selects an action with the highest estimated value, but with small probability selects an action uniformly at random, called -greedy policy that is one of the most commonly used methods (see Section 2.5 for other action selection methods).

Algorithm 5 Sarsa (on-policy TD) for estimating Q q

1: Initialize Q(s; a) for all s 2 S; a 2 A(s), arbitrarily, and Q(terminal-state, ) = 0

2: repeat for each episode

3: Initialize s

4: Choose a from s using policy derived from Q (e.g., -greedy)

5: repeat for each step of episode

6: Take action a, observe r; s0

7: Choose a0 from s0 using policy derived from Q (e.g., -greedy)

8: Q(s; a) Q(s; a) + [r + Q(s0; a0) Q(s; a)]

9: s s0

10: a a0

11: until S is terminal

12: until

If all state-action pairs are visited in nitely often, Sarsa converges with proba-bility 1 to an optimal policy and action-value function.

O -policy TD method: Q-learning Now we consider an o -policy TD method called Q-learning [97]. This method learns the optimal action-value, regardless of the policy being followed. This method is de ned by: Q(st; at) Q(st; at) + rt+1 + max Q(st+1; a)

A minimal requirement of convergence to the optimal policy is that all state-action pairs are visited an in nite number of times. Under this assumption and a variant of the usual stochastic approximation conditions on the sequence of step-size parame-ters, Q has been shown to converge with probability 1 to the optimal action-values q [94].

The pseudocode of Q-learning is given in Algorithm 6.

On-policy and O -policy. An important challenge of reinforcement learning is the exploration{exploitation dilemma. The agent has to learn the optimal policy

Algorithm 6 Q-learning (o -policy TD) for estimating

1: Initialize Q(s; a) for all s 2 S; a 2 A(s), arbitrarily, and Q(terminal-state, ) = 0

2: repeat for each episode

3: Initialize s

4: repeat for each step of episode

5: Choose a from s using policy derived from Q (e.g., -greedy)

6: Take action a, observe r; s0

7: Q(s; a) Q(s; a) + [r + maxa Q(s0; a) Q(s; a)]

8: s s0

9: until S is terminal

10: until

while behaving non-optimally, i.e., by exploring all actions. This dilemma brings about two main approaches for learning action values: on-policy and o -policy.

In on-policy methods, the agent learns the best policy while using it to make decisions. On the other hand, o -policy methods separate it into two policies. That is, the agent learns a policy di erent from what currently generates behavior. The policy being learned about is called the target policy, and the policy used to generate behavior is called the behavior policy. Since learning is from experience \o » the target policy, these methods are called o -policy learning. The on-policy methods are generally simpler than o -policy methods but they learn action values not for the optimal policy, but for a near-optimal policy that still explores [94]. The o – policy methods learn the optimal policy and they are considered more powerful and general but they are often of greater variance and are slower to converge [94].

While on-policy methods learn policies depending on actual behavior, o -policy methods learn the optimal policy independent of agent’s actual behavior, i.e., the policy actually used during exploration. In Sarsa, it updates action values using a value of the current policy’s action a0 in next state s0. In Q-learning, it updates its action-value using the greedy (or optimal) action a0 of next state s0 but the agent selects an action by -greedy policy. In Q-learning, the target policy is the greedy policy and the behavior policy is -greedy policy.

#### Function Approximation Methods

Many classical reinforcement-learning algorithms have been applied to small nite and discrete state spaces and value functions are represented using a tabular form that stores the state(-action) values in a table. In such small and discrete problems, a lookup table represents all state-action values of a learning space. However, in many realistic problems with large and continuous state spaces, there are many more states than could possibly be entries in a table. In such problems, a major challenge is to represent and store value functions. Thus, the tabular methods typically used in reinforcement learning have to be extended to apply to such large problems, for example, using function approximation methods. The approximate value function is represented as a parameterized functional form with weight vector w 2 Rd. v^(s; w) v (s) denotes the approximate value of state s given weight vector w.

First we consider the squared di erence between the approximate value v^(s; w) and the true value v (s) over the state space, i.e., the Mean Squared Value Er-ror. Then, we study gradient-descent methods to minimize the error. Finally, we introduce linear function approximation based on the gradient-descent method.

Mean Squared Value Error. In tabular methods, learning at a certain state yields an update to the state’s value function, though the values of all other states are left unchanged. That is, an update is applied only to the current state and it does not a ect value functions of the other states. Each state-action value from a lookup table represents the true value function of one state-action pair. However, in approximation methods, the number of states is larger than the number of weights, the dimensionality of weight vector w. Thus, an update at one state a ects the estimated values of many other states and it is not possible that all state values are correctly estimated [94]. Updating at one state makes its estimated value more accurate, but it may make values of other states less correct because the estimated values of other states are changed as well. Hence, in the function approximation, rather than trying to make zero error of value functions for all states, we aim to balance the errors in di erent states [94]. For that, it is necessary to specify a state weighting or distribution (s) 0; Ps (s) = 1 in order to represent how much we care about the error in each state s [94]. For example, the fraction of time spent in s may be used as (s). The squared di erence between the approximate value v^(s; w) and the true value v (s)

**Table of contents :**

**1 Introduction **

1.1 Motivation and Objective

1.2 Reinforcement Learning

1.3 Overview and Contributions

**I Literature Review **

**2 Reinforcement Learning **

2.1 Markov Decision Processes

2.2 Dynamic Programming

2.2.1 Policy Iteration

2.2.2 Value Iteration

2.2.3 Interaction between Policy Evaluation and Policy Improvement

2.3 Temporal-Dierence Methods

2.4 Function Approximation Methods

2.5 Exploration

2.6 Model-Based and Model-Free Methods

2.7 Priority-Based Value Iteration

2.8 Non-Stationary Environment

**II Applications **

**3 Model-Free and Model-Based Methods **

3.1 Introduction

3.2 Learning without Models

3.2.1 Background and Related Work

3.2.2 Q-learning for Taxi Routing

3.2.3 Performance Evaluation

3.2.4 Demonstration Scenario

3.3 Learning Models

3.3.1 Background: Factored MDP

3.3.2 Related work

3.3.3 Algorithm for Structure Learning

3.3.4 Experiments

3.4 Discussion and Future Research

3.5 Conclusion

**4 Focused Crawling **

4.1 Introduction

4.2 Background

4.3 Focused Crawling and Reinforcement Learning

4.3.1 Markov Decision Processes (MDPs) in Crawling

4.3.2 MDPs with Prioritizing Updates

4.3.3 Linear Function Approximation with Prioritizing Updates

4.4 Experimental Results

4.5 Related Work

4.6 Future Work

4.7 Conclusion

**5 Inuence Maximization **

5.1 Introduction

5.2 Background

5.3 Topic-Based Inuence Maximization Algorithm for Unknown Graphs

5.3.1 Problem Statement and our Method

5.3.2 Modeling and Algorithm

5.4 Related Work

5.5 Future Work

5.6 Conclusion

**6 Conclusion **

6.1 Future Work

6.2 Conclusion

**Bibliography **

**Appendices **