MARKOV DECISION PROCESSES AND REINFORCEMENT LEARNING IN ROBOTICS

Get Complete Project Material File(s) Now! »

MARKOV DECISION PROCESSES AND REINFORCEMENT LEARNING IN ROBOTICS

Introduction

In recent years, we have seen a fast development of using machine learning tech-niques onto robot control problems. Machine learning helps an agent to learn from example data or past experience to solve a given problem. In supervised learning, the learner is provided an explicit target for every single input, that is, the environment tells the learner what its response should be. In contrast, in reinforcement learning, only partial feedback is given to the learner about the learner’s decisions. Therefore, under the framework of RL, the learner is a decision-making agent that takes actions in an environment and receives reward (or penalty) for its actions in trying to solve a problem. After a set of trial-and-error runs, it should learn the best policy, which is the sequence of actions that maximizes the total reward (Sutton & Barto, 1998).
Reinforcement learning is typically operated in a setting of interaction, shown in Figure 2.1: the learning agent interacts with an initially unknown environment, and receives a representation of the state and an immediate reward as the feed-back. It then calculates an action, and subsequently undertakes it. This action causes the environment to transit into a new state. The agent receives the new representation and the corresponding reward, and the whole process repeats.
The environment in RL is typically formulated as a Markov Decision Process (MDP), and the goal is to learn to a control strategy so as to maximize the total reward which represents a long-term objective. In this chapter, we introduces the tructural background of Markov Decision Process and reinforcement learning in robotics.

Markov Decision Processes

A Markov Decision Process describes a sequential decision-making problem in which an agent must choose the sequence of actions that maximizes some reward-based optimization criterion (Puterman, 1994; Sutton & Barto, 1998). Formally, an MDP is a tuple M = {S, A, T , r, γ}, where
• S = {s1, . . . , sN } is a finite set of N states that represents the dynamic environment,
• A = {a1, . . . , ak} is a set of k actions that could be executed by an agent,
• T : S × A × S −→ [0, 1] is a transition probability function, or transition model, where T (s, a, s ) stands for the state transition probability upon applying action a ∈ A in state s ∈ S leading to state in state s ∈ S, i.e. T (s, a, s ) = P (s | s, a),
• r : S × A −→ R is a reward function with absolute value bounded by Rmax; r(s, a) denotes the immediate reward incurred when action a ∈ A is executed in state s ∈ S,
• γ ∈ [0, 1) is a discount factor.
Given an MDP M, the agent-environment interaction in Figure 2.1 happens as follows: let t ∈ N denote the current time, let St ∈ S and At ∈ A denote the random state of the environment and the action chosen by the agent at time t, respectively. Once the action is selected, it is sent to the system, which makes a transition:
(St+1, Rt+1) ∼ P (• | St, At). (2.1)
In particular, St+1 is random and P (St+1 = s | St = s, At = a) = T (s, a, s ) holds true for any s, s ∈ S, a ∈ A. Furthermore, E [Rt+1 | St, At] = r(St, At). The agent then observes the next state St+1 and reward Rt+1, chooses a new action
∈ A and the process is repeated.
The Markovian assumption (Sutton & Barto, 1998) implies that the sequence of state-action pairs specifies the transition model T :
P (St+1 | St, At, • • • , S0, A0) = P (St+1 | St, At). (2.2)
State transitions can be deterministic or stochastic. In the deterministic case, taking a given action in a given state always results in the same next state; while in the stochastic case, the next state is a random variable.
The goal of the learning agent is to figure out a theory of choosing the actions so as to maximize the expected total discounted reward:
∞ (2.3)
R =γtRt+1.
t=0
If γ < 1 then the rewards received far in the future are exponentially less worthy than those received at the first stage.

Policies and Value Functions

The agent selects its actions according to a special function called policy. A policy is defined as a mapping π : S × A −→ [0, 1] that assigns to each s ∈ S a distribution π(s, •) over A, satisfying a A π(a | s) = 1, ∀s ∈ S.
A deterministic stationary policy is the case that for all s ∈ S, π(• | s) is concentrated on a single action, i.e. at any time t ∈ N, At = π(St). A stochastic stationary policy is a function that maps each state into a probability distribution over the diﬀerent possible actions, i.e., At ∼ π(• | St). The class of all stochastic stationary policies is denoted by Π.
Application of a policy is done in the following way. First, a start state S0 is generated. Then, the policy π suggests the action A0 = π(S0) and this action is performed. Based on the transition function T and reward function r, a transition is made to state S1, with a probability T (S0, A0, S1) and a reward R1 = r(S0, A0, S1) is received. This process continues, producing a sequence S0, A0, R1, S1, A1, R2, S2, A2, …, as shown in Figure 2.2.
Value functions are functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The notion of « how good » here is defined in terms of future rewards that can be expected, or, to be precise, in terms of expected return. Of course the rewards the agent can expect to receive in the future depend on what actions it will take. Accordingly, value functions are defined with respect to particular policies (Sutton & Barto, 1998).
Given a a policy π, the value function is defined as a function V π : S −→ R that associates to each state the expected sum of rewards that the agent will receive if it starts executing policy π from that state: V π(s) = Eπ γtr(St, At) | S0 = s , ∀s ∈ S. t=0
St is the random variable representing the state at time t, At is the random variable corresponding to the action taken at that time instant and is such that P (At = a | St = s) = π(s, a). (St, At)t≥0 is the sequence of random state-action pairs generated by executing the policy π.

Partially Observable Markov Decision Processes

In the setting of an MDP, the agent should obtain a precise state information of the environment at any moment. Unfortunately, this assumption may not hold in practice in respect that the agent observes the world through limited and imperfect receptors, and the information contained in the perceptions is not suﬃcient for determining the state. Partially Observable Markov Decision Processes (POMDPs) (Smallwood & Sondik, 1973) was introduced to handle such cases.

Table of contents :

1 Introduction
1.1 Background and Motivation
1.1.1 Mobile Robotics
1.1.2 Autonomous Navigation
1.1.3 Robot Learning
1.1.4 Motivation
1.2 Contributions of the Dissertation
1.3 Organization of the Dissertation
2 Markov Decision Processes and Reinforcement Learning in Robotics
2.1 Introduction
2.2 Markov Decision Processes
2.2.1 Policies and Value Functions
2.2.2 Partially Observable Markov Decision Processes
2.3 Dynamic Programming: Model-Based Algorithms
2.3.1 Policy Iteration
2.3.2 Value Iteration
2.4 Reinforcement Learning: Model-Free Algorithms
2.4.1 Goals of Reinforcement Learning
2.4.2 Monte Carlo Methods
2.4.3 Temporal Difference Methods
2.5 Mobile RobotModel
2.5.1 Uncertainty in Mobile Robots
2.6 Conclusion
3 Policy Learning from Multiple Demonstrations
3.1 Introduction
3.2 Related Work
3.3 Neural NetworkModel
3.3.1 Backpropagation Algorithm
3.4 Policy Learning from Demonstrations
3.4.1 Dataset Extraction from Demonstrations
3.4.2 Architecture of Neural Network
3.4.3 Policy Learning Process
3.4.3.1 Neural Network Training
3.4.4 Algorithm
3.5 Demonstration by Modified A* Algorithm
3.5.1 Node Representation
3.5.2 Algorithm
3.5.3 Result
3.6 Experimental Results
3.6.1 Policy Learning Process
3.6.2 Robot navigation in unknown environments
3.6.3 Paths Comparison
3.6.4 Autonomous Navigation in Dynamic Environments
3.6.5 Discussions
3.7 Conclusion
4 Reinforcement Learning under Stochastic Policies
4.1 Introduction
4.2 Related Work
4.3 Model-Free Reinforcement Learning Methods
4.3.1 Q-Learning
4.3.2 SARSA
4.3.3 Function Approximation using Feature-Based Representations
4.4 Neural Network based Q-Learning
4.4.1 State and Action Spaces
4.4.2 Reward Function
4.4.3 The Stochastic Control Policy
4.4.4 State-Action Value Iteration
4.4.5 Algorithm
4.4.5.1 Training Process of NNQL
4.4.5.2 Robot Navigation Using NNQL
4.5 Experimental Results
4.5.1 Self-learning Results
4.5.2 Autonomous Navigation Results
4.5.3 Comparison and Analysis
4.5.4 Autonomous Navigation in Dynamic Environments
4.5.5 Discussions
4.6 Conclusion
5 Learning Reward Functions with Nonlinear Neural Policy Representations
5.1 Introduction
5.2 Related Work
5.3 Inverse Reinforcement Learning
5.3.1 Preliminaries
5.3.2 Inverse Reinforcement Learning
5.4 Nonlinear Neural Policy Representations
5.4.1 State and Action Spaces
5.4.2 Neural Policy Representation
5.4.3 Stochasticity of Policy
5.5 Neural Inverse Reinforcement Learning
5.5.1 Suboptimal Demonstration Refinement via Maximum a Posteriori Estimation
5.5.2 Model-free Maximum Margin Planning
5.5.3 Neural Policy Iteration
5.5.4 The Algorithm for Neural Inverse Reinforcement Learning
5.5.5 Expert Demonstrations
5.5.5.1 Computer-based Expert Demonstrations
5.5.5.2 Human Demonstrations
5.6 Experimental results
5.6.1 Comparison of Two Types of Demonstrations
5.6.2 Neural Inverse Reinforcement Learning
5.6.3 Robot navigation in new unknown environments
5.6.4 Analysis of the weight changes
5.6.5 Autonomous Navigation in Dynamic Environments
5.6.6 Discussions
5.7 Conclusion
Conclusions and Perspectives
Résumé Étendu en Français
References