Get Complete Project Material File(s) Now! »

## Deterministic Policy Gradient Method

Silver et al. [2014] introduced Deterministic Policy Gradient (DPG) theorem which extended the actor-critic stochastic policy gradient methods to determin-istic policies: ∇θJ (θ) = pπ (τ )∇θ Qπ (s, a = π(s|θ))ds (2.14) H −1 = E ∇θ π(at|st, θ)∇at Qπ (st, at)|at =π(st |θ) (2.15).

Although DPG algorithms provide a better theoretical basis for continu-ous control using policy gradient method, naïve application of this actor-critic method with neural function approximators is unstable for challenging prob-lems. In fact, Silver et al. [2014] showed that the function approximator for the action-value (the critic) in DPG should be linear to avoid instability in the optimization process. However, following the breakthrough in the deep Q-learning [Mnih et al., 2013, 2015b], where a deep neural network was used to approximate the action-value function, Lillicrap et al. [2016] proposed an algo-rithm called Deep Deterministic Policy Gradient (DDPG) that used deep neural network for the critic in DPG approach. Similar to Deep Q-learning, DDPG was able to learn the critic in a stable and robust manner primarily due to two ideas: (1) the network was trained oﬀ-policy with samples from a replay buﬀer to minimize correlations between samples; (2) the network is trained with a target Q network to give consistent targets during temporal diﬀerence backups.

**Natural Policy Gradient Method**

The policy gradient methods discussed in the previous sections assume that all the dimensions of the policy parameter θ have similar eﬀects on the resulting distribution pθ (τ ) and thus take a small step ∆θ along the policy gradient ∇θ J (θ) where ∆θ = α∇θ J (θ), α is the learning rate (2.16). However, a small change in θ might case a large change in the resulting distribution pθ+Δθ (τ ) compared to the old distribution p θ (τ ). This kind of update can be catastrophic and lead to unstable behavior in the learning process since the policy gradient ∇θ J (θ) is computed based on the samples according to the distribution pθ (τ ). To have a stable learning process, it is desirable that the new distribution pθ+Δθ (τ ) is closer to the old distribution pθ (τ ), i.e., DK L[pθ (τ )||pθ+Δθ (τ )] ≈ ∆θT F (θ)∆θ ≤ ǫ, for some small ǫ (2.17).

where, DK L[·||·] is the Kullback–Leibler divergence, and F (θ) captures how a single parameter influence the distribution pθ (τ ). F (θ) is called the Fisher information matrix (FIM) and is given by: F (θ) = Epθ (τ ) ∇θ log pθ (τ )∇θ log pθ (τ )T H −1 = Epθ (τ ) ∇θ log π(at|st, θ)∇θ log π(at|st, θ)T (2.18).

The Natural Policy Gradient (NPG) [Kakade, 2002; Peters and Schaal, 2008a; Bhatnagar et al., 2008] formulates this idea as an optimization prob-lem: arg max Epθ (τ ) πθ (a|s) Qold(s, a) (2.19) old πθold (a|s) subject to DK L[pθold (τ )||pθ (τ )] ≤ ǫ (2.20).

**Weighted Maximum Likelihood Methods**

The basic idea behind this class of methods is to change the current policy π(a|s, θold) such that in the new policy π(a|s, θnew ) the probability of high re-warding actions is higher [Dayan and Hinton, 1997]: π(a|s, θnew ) ∝ f (r(s, a))π(a|s, θold) (2.25).

Where, f (r(s, a)) is the success probability. Several algorithms have been proposed based on this idea using expectation maximization (EM) formula-tion. Peters and Schaal [2007] proposed Reward Weighted Regression (RWR) where f (r(s, a)) is taken as parametric function of the reward whose parame-ters are updated along with the policy parameters. Then Kober and Peters [2009] improved RWR by proposing a new algorithm called PoWER, which has better exploration compared to RWR. Similarly, MCEM [Vlassis et al., 2009] generalized the PoWER algorithm and used Monte-Carlo approach for expec-tation maximization. Although these approaches alleviate the need to specify the learning rate, still EM-based approach does not guarantee a stable update of the policy, i.e., staying closer to the data. The information-theoretic ap-proach Relative Entropy Policy Search(REPS) [Peters et al., 2010] combined the advantages of both natural policy gradient method and weighted maximum likelihood method to allow stable update of the policy parameters without the need of specifying the learning rate.

### Trust Region Optimization Method

In the previous section we have seen that for a stable learning process, it is desirable to keep the updated policy close to the old policy so that the data distribution of the new policy is not very diﬀerent from the old policy. This objective was achieved using a KL constraint on the policy in NPG method (section 2.3.3). However, the KL constraint was approximated in this method by using a second-order Taylor series expansion of the KL. Due to this approx-imation, it is still possible that the new policy violates the KL constraint. To guarantee that the KL constraint is always satisfied, Schulman et al. [2015] pro-posed an algorithm called Trust Region Policy Optimization (TRPO) which is very similar to NPG. The main diﬀerence is that TRPO performs a simple line-search on the step-size parameter to guarantee that the KL constraint is always satisfied after the policy update. Thus, TRPO can be thought of as “NPG + Line-search on the step size”. To be more precise, if ∆θ is the proposed policy update given by NPG, then TRPO update is given by: 2ǫ ∆θ = F (θ)−1∇θJ (θ) ∇θ J (θ)T F (θ)−1∇θ J (θ) θnew = θold + αn∆θ ; n ∈ {0, 1, 2, 3, …, L} (2.26). Where, n is optimized using simple line-search, i.e., iteratively trying val-ues from {0, 1, 2, 3, …, L} so that KL constraint (equation 2.20) is satisfied. With this extension, TRPO seemed to perform much better (faster conver-gence, higher asymptotic performance) than its predecessors. In the domain of robotic locomotion, TRPO successfully learned controllers for swimming, walk-ing and hopping in a physics simulator, using general-purpose neural networks and minimally informative rewards. Before TRPO, no prior work has learned controllers from scratch using gradient-based policy optimization for all of these tasks [Schulman et al., 2015], using a general policy search method and non-engineered, general-purpose policy representations.

#### Information-theoretic

Information-theoretic approaches for policy optimization utilize the concept of entropy for stable policy updates using the last observed trajectories. More specifically, these approaches try to set an upper bound on the relative entropy between the old policy and the new policy so that policy update does not change the old policy aggressively. In this context, when we say policy, we refer to the data or trajectory distribution generated by the policy. Information-theoretic approaches try to keep the new trajectories closer to the last trajectory distri-bution since the gradient for the policy parameters is estimated based on the last trajectories and it is expected to be valid only around that trajectory dis-tribution. It is to be noted here that constraining the policy update in the data distribution is completely diﬀerent from constraining the policy update directly on its parameters. The reason is that a small change in the parameters of the policy might produce a significant change in the data or trajectory distribution under the new policy.

ME-TRPO: Kurutach et al. [2018] proposed Model Ensemble Trust Region Policy Optimization (ME-TRPO), which extended model-free policy search al-gorithm TRPO [Schulman et al., 2015] to model-based policy search problem. ME-TRPO uses an ensemble of neural networks to model the dynamics of the system. The ensemble is trained using mean-squared-error loss for the data collected from the trials on the system. In the policy improvement step, the policy is updated using TRPO, based on the imaginary or simulated experi-ence generated on the learned dynamics model. During policy evaluation, at every step, ME-TRPO randomly chooses a model from the ensemble to pre-dict the next state given the current state and action. This restricts the policy from overfitting to any single model during an episode, leading to more sta-ble learning. ME-TRPO demonstrated similar performance as the asymptotic performance of the state-of-the-art model-free algorithms such as PPO, DDPG, TRPO, etc. on many simulated high dimensional locomotion tasks while using approximately 100 times fewer data than the model-free baselines.

GPREPS: Kupcsik et al. [2017] proposed another information-theoretic model-based policy search algorithm called Gaussian Process Relative Entropy Policy Search (GPREPS) which essentially utilizes Relative Entropy Policy Search (REPS) [Peters et al., 2010] algorithm for contextual policy optimiza-tion. GPREPS learns the dynamics model from the observed data using Gaus-sian Process regression [Rasmussen and Williams, 2006] and it does not require any assumption on the parametrization of the policy or the reward function. GPREPS generates the artificial trajectories on the GP model to compute the expected return and updates the policy using REPS so that the new policy stays closer to the old policy according to the model. The authors were able to train a robotic arm to return a table tennis ball in simulation after 150 evaluations on the real system using GPREPS algorithm, whereas model-free REPS took 4000 evaluations to attain the same performance.

**Data-eﬃcient Robot Learning with Priors**

So far, we have discussed various policy search approaches, especially in the context of robotics, to solve the reinforcement learning problem. We have seen that model-based reinforcement learning algorithms, more specifically, model-based policy search algorithms are more data-eﬃcient and hence optimize the policy with significantly less interaction time with the real robot. However, for complex robots, such as a hexapod robot or a humanoid robot, learning an eﬀective forward dynamical model from the data will still require a lot of trials on the real robot. In fact, the dynamical model is even more diﬃcult to learn if the robot has discontinuous dynamics due to contacts. Similar to the dynamics model, it is often hard to learn a policy for complex tasks in just a few trials. Learning becomes even more diﬃcult when the policy is represented by function approximators such as a neural network with a lot of parameters.

Slow learning of the dynamics model and the policy restrict the application of reinforcement learning in robotics, where the robot has to adapt to a new situation (e.g., new terrain condition, broken joints) online and accomplish the mission. Interestingly, unlike robots, animals can learn new skills and adapt to a new situation (such as uneven terrain, broken limbs) within minutes, if not seconds, by eﬀectively utilizing past experiences or priors. Likewise, reinforce-ment learning in robotics can be made more data-eﬃcient using priors (Figure 2.3). In Bayesian statistics, the word “prior” refers to the prior probability distribution that is multiplied by the likelihood of the data and then normal-ized to compute the posterior. In this way, priors represent knowledge, before taking into account the data, as probability distributions. However, here we use the word prior more broadly. In our context, the term “prior” represents the prior knowledge about the reinforcement learning problem that is available before taking into account the data from a specific problem instance. We do not restrict ourselves to represent these priors in the form of probability distribu-tions. Priors can be utilized in many diﬀerent ways in reinforcement learning, depending upon where the prior is inserted in the learning framework. We classify these methods mainly into 4 types (Fig. 2.3):

1. Priors on the dynamical model.

2. Priors on the policy.

3. Priors on the objective function model.

4. Hybrid approaches.

**Table of contents :**

**1 Introduction **

**2 Background **

2.1 Reinforcement Learning Problem

2.2 Value-Function Approach

2.2.1 Monte-Carlo Methods

2.2.2 Temporal-difference (TD) methods

2.3 Policy Search Approach

2.3.1 Stochastic Policy Gradient Method

2.3.2 Deterministic Policy Gradient Method

2.3.3 Natural Policy Gradient Method

2.3.4 Weighted Maximum Likelihood Methods

2.3.5 Trust Region Optimization Method

2.3.6 Direct policy search

2.4 Model-Based Policy Search

2.4.1 Learning the Model

2.4.2 Policy optimization

Back-propagation through time

Direct policy search

Information-theoretic

Sampling-based

2.5 Data-efficient Robot Learning with Priors

2.5.1 Priors on the dynamical model

Generic Robotic Priors

Gaussian Process Model with Non-Constant Prior

Meta-Learning of Dynamical Model

2.5.2 Priors on the policy

2.5.3 Priors on the objective function model

2.5.4 Repertoire-based prior

2.6 Conclusion

**3 Data-efficient Robot Policy Search in Sparse Reward Scenarios **

3.1 Introduction

3.2 Problem Formulation

3.3 Approach

3.3.1 Learning system dynamics and reward model

Learning system dynamics with sparse transitions

3.3.2 Exploration-Exploitation Objectives

3.3.3 Multi-Objective Optimization

3.4 Experiments

3.4.1 Sequential goal reaching with a robotic arm

3.4.2 Drawer opening task with a robotic arm

3.4.3 Deceptive pendulum swing-up task

3.4.4 Additional Experiments

Drawer opening task with 4-DOF arm

Multi-DEX on non-sparse reward task

3.4.5 Pareto optimality vs. weighted sum objectives

3.5 Discussion and Conclusion

3.6 Details on the Experimental Setup

3.6.1 Simulator and source code

3.6.2 General and Exploration Parameters

3.6.3 Policy and Parameter Bounds

3.6.4 NSGA-II Parameters

3.6.5 Gaussian Process Model learning

**4 Adaptive Prior Selection for Data-efficient Online Learning **

4.1 Introduction

4.2 Problem Formulation

4.3 Approach

4.3.1 Overview

4.3.2 Generating Repertoire-Based Priors

4.3.3 Learning the Transformation Models with Repertoires as Priors

4.3.4 Model-based Planning in the Presence of Multiple Priors

4.4 Experimental Results

4.4.1 Object Pushing with a Robotic Arm

4.4.2 Goal Reaching Task with a Damaged Hexapod

4.5 Discussion and Conclusion

**5 Fast Online Adaptation through Meta-Learning Embeddings of Simulated Priors **

5.1 Introduction

5.2 Approach

5.2.1 Meta-learning the situation-embeddings and the dynamical model

5.2.2 Online adaptation to unseen situation

5.3 Experimental Results

5.3.1 goal reaching with a 5-DoF planar robotic arm

5.3.2 Ant locomotion task

5.3.3 Quadruped damage recovery

5.3.4 Minitaur learning to walk

5.4 Discussion and Conclusion

**6 Discussion **

6.1 Learning the Dynamical Model from the Observations

6.2 Model-Learning in Open-Ended Scenarios

6.3 Repertoire-based Learning

6.4 Using Priors from the Simulator

6.5 What is Next?

**7 Conclusion **

**Bibliography **