Pros and Cons of End-to-End Deep Reinforcement Learning 

Get Complete Project Material File(s) Now! »


Understanding the current state of controlled systems is part of the requirement to learn tasks, such as those of RL. In a context where the state is replaced by a sensory measurement, RL can often be made more efficient by solving first a state estimation problem [Tesauro, 1992, Bertsekas and Tsitsiklis, 1996]. One of the first approaches was proposed by Kalman [1960b] that assumes knowledge of the transition model which must be linear with respect to the state. However, it has been adapted to the nonlinear context with the extended Kalman filter (EKF) [Ljung, 1979]. A more recent type of technique dedicated to navigation tasks is SLAM (Simultaneous Localization and Mapping [Bailey and Durrant-Whyte, 2006a,b]) which provides representations that contain the location and orientation of an agent.
These traditional state estimation techniques have significantly increased the autonomy and intelligence of robots. However, in their usual formulations, they assume the a priori knowledge of the transition model, and the ground truth state in order to define an observation model. Moreover, their state representations focus on self-localization and ego-motion estimation, which may be insufficient for complex tasks that require understanding more abstract concepts of the environment. Moreover, these traditional methods require experts to tune their parameters, which is difficult due to the “curse of manual tuning” [Cadena et al., 2016]. As a collateral effect, they tend to be non-generalizable across tasks, as cross-validation is not an option in an unsupervised learning context.
Feature engineering methods have relaxed the assumption of the a priori knowledge of the transition model and the ground truth state by being solely task specific. Among the most popular techniques are SIFT [Lowe, 1999] and SURF [Bay et al., 2006]. However, it remains challenging to devise ad hoc methods that extract higher-level and more abstract features, even when the task is known. Moreover, in a context of ever-increasing automation, it is necessary to remove the assumption of the a priori knowledge of the task and the environment. This is where SRL comes in, automating the feature engineering process by leveraging experiences discovered by agents to pretrain state estimators [Lesort et al., 2018]. Such pretrained state embeddings can represent numerically in vectors all the concepts of the input, from the least to the most abstract. Thanks to the data- driven principle of machine learning, the designer remains outside the learning loop, and thus the required expert knowledge only concerns the hyperparameters of the SRL algorithms. This thesis seeks to extend the capabilities of SRL to help scale DRL algorithms to continuous control tasks with high-dimensional sensory observations. Al-though end-to-end DRL has recently performed well on such tasks, it relies on many computational resources [Barth-Maron et al., 2018]. This makes it impracti-cal to apply DRL to such tasks with computational restrictions, as in real-world robotics. SRL allows to improve the performance of DRL by providing it with better inputs than the input embeddings learned from scratch with end-to-end strategies. Specifically, this thesis addresses the problem of performing state estimation in the manner of deep unsupervised pretraining of state representa-tions without reward. These representations must verify certain properties to allow for the correct application of bootstrapping and other decision making mechanisms common to supervised learning, such as being low-dimensional and guaranteeing the local consistency and topology (or connectivity) of the environment [Penedones et al., 2018, Morik et al., 2019], which we will seek to achieve through the models pretrained with the two SRL algorithms proposed in this thesis.

Deep Learning Training Problems

Deep learning training problems correspond to optimization problems that are solved by fitting a neural network parameters q (a.k.a. weights) to minimize an objective function. An objective function is defined with a loss function ‘, which in the case of regression can for example be the mean square error (MSE), i.e. based on the L2 norm. It measures the difference between the output of the target function f and that of the neural network fq (a.k.a. prediction), as follows: ‘( fq(x), y) = kyˆ yk22 (1).
where (x, y) corresponds to the input-output pair obtained with evaluation of the unknown target function ( f (x) = y). In practice, the objective function corresponds to the empirical risk, which is computed on a finite training dataset as follows: L 1 N N n=1 (q) = å ‘ fq(xn), yn (2).
where f(xn, yn)g1 n N is a training dataset of cardinality N formed from different input-output pairs. Due to the empirical nature of this objective function, training a neural network is similar to an approximate optimization [Bottou and Bousquet, 2008].
Theoretically, an objective function corresponds to the expected risk on an un-limited training dataset according to a data distribution PX : X ! [0, 1] (where X RD is the data space) from which input-output pairs (x, f (x)) are drawn, as follows: Z LD(q) = PX(x)‘ fq(x), f (x) dx (3) x2X However, the knowledge of PX and the computation of the full integral is impos-sible in practice. This is why minimizing Eq. 3 is impractical in most problems. Following Bottou and Bousquet [2008], we assume that the empirical risk L(q) and the expected risk LD(q) have a global minimum at q and qD respectively. Moreover, Bottou and Bousquet [2008] let ˆ be the practical solution of the q empirical risk given by an optimization algorithm. The quality of this solution depends on the time budget Tmax needed to obtain a tolerance error #Tmax such that: E L(ˆ)L ( ) <# (4) q q Tmax0.

Stochastic Gradient Descent

The second aspect – properties of optimization algorithms – makes it possible to avoid the exponential increase in the number of calculations needed to approx-imate high-dimensional target functions. In particular, the stochastic gradient descent (an instance of stochastic optimization algorithms) can find solutions with low estimation error rates without exploring the whole parameter space of a neural network [Zhang et al., 2017a, Nguyen and Hein, 2018], thus reducing the training time needed for convergence.
Historically, Robbins and Monro [1951] proposed the stochastic optimization, which allows learning function approximators on large discrete training datasets, by mixing statistics and optimization ideas. Robbins-Monro’s theorem shows that the objective function no longer needs to be evaluated directly but only with an unbiased estimate of its gradient with respect to the approximator parameters. It can be seen as an incremental gradient descent algorithm. This eliminates the cardinality dependency that is problematic in large cardinality regimes, as only randomly picked mini-batches of data points are needed instead of the whole dataset in each optimization iteration. One of the first successes brought by this stochastic optimization to train neural networks was obtained with ADA-LINE (Adaptive Linear Neuron [Widrow and Hoff, 1960]) to solve least-squares problems. Years later, Lenet-5 introduced by LeCun et al. [1998] extended this stochastic optimization to a convolutional neural network to solve document recognition problems. Nowadays, the stochastic optimization line of work still benefits from many improvements thanks to the considerable interest of the machine learning com-munity. Several scientific avenues have been taken, including optimization algorithms that are faster than the stochastic gradient descent method, such as the momentum method [Polyak, 1964, Sutskever et al., 2013], or an extended version Adam [Kingma and Ba, 2014] which adapts the first momentum.


Convolutional Neural Networks

Recent successes in machine learning are largely triggered by the emergence of convolutional neural networks (CNNs). CNNs were introduced by LeCun et al. [1989b] as a special case of FNNs better adapted to tensor data, which continue to benefit from numerous extensions since then. They allow to better bound the approximator complexity with respect to the generalization performance with structured inputs such as images [Poggio et al., 2017].
From a high level viewpoint, CNNs like FNNs use compositions of simple nonlinear layers to approximate a target function f . A hidden convolutional layer denoted f (l) performs a convolution between filters and features of the previous layer, followed by an activation. Thus, it looks like a hidden fully connected layer, with the difference that it performs a special case of affine transform. Indeed, the convolution can be considered as an instance of matrix multiplication which requires vectorizing the input and reducing the filter to a second order tensor (i.e. a matrix), making it a circulant matrix (with many null parameters), as explained by Cichocki et al. [2017].
The output of a l-th hidden convolutional layer is a three-dimensional tensor x(l ) 2 Rd1(l) d2(l) d3(l) whose first two dimensions (d1(l), d2(l)) correspond respectively to the spatial height and width of the neurons (a.k.a. feature maps), while the third dimension d3(l) corresponds to the number of channels. It can be defined recursively by letting x(0) = x, where we again volontarily omit the bias terms of the affine transforms for clarity reasons.

Reinforcement Learning Formalism

The general optimal control problem addressed by RL has only incomplete knowledge of the environment, which may also be stochastic. More precisely, the transition probabilities are unknown. Thus, RL algorithms generate training examples with agent-environment interaction loops as shown in Fig. 1. The goal of RL is to maximize the expected future cumulative discounted rewards (i.e. the expected return Gt), that an agent receives by interacting with the environment. To do this, a RL algorithm learns an optimal policy (denoted p ) which predicts actions that maximize the return.

Markov Decision Processes

The RL optimization process can be formally defined with Markov decision processes (MDPs). An MDP is succinctly denoted as M = hS, A, P, R, gi, whose five elements are defined in the following.
The state and action spaces We consider only continuous states and actions in this thesis, this is why we speak only of spaces and not of sets as would be the case for discrete spaces. S RSd corresponds to the state space and A RAd corresponds to the action space. A state st 2 S gathers sufficient information from the environment for an optimal policy to take a best action at 2 A. It includes proprioceptive information of the agent, such as the position of its actuators and their velocities. In classical optimal control problems, the state is available, whereas in the problems we study, it is not directly available. The environment provides instead of states, observations (ot 2 O ROd where O is the observation space) which are most often visual sensory measurements. These are high-dimensional observations (i.e. typically Sd << Od) which generally suffer from partial observability of the environment. This corresponds to a state representation learning formalism which will be described thoroughly in Chapter 3.

Table of contents :

1.1 State Representation Learning
1.2 A Brief History of State Estimation
1.3 Thesis Outline
1.3.1 Summary
2.1 Introduction
2.1.1 The Curse of Dimensionality
2.2 Deep Learning
2.2.1 Deep Learning Training Problems
2.2.2 Deep Learning Approximators
2.3 Reinforcement Learning
2.3.1 Reinforcement Learning Formalism
2.3.2 Temporal-Difference Learning
2.3.3 Policy Gradient
3.1 Introduction
3.1.1 Dimensionality Reduction
3.2 Scaling Reinforcement Learning Towards Robotics
3.2.1 Pros and Cons of End-to-End Deep Reinforcement Learning
3.2.2 Potential Solutions
3.3 SRL Formulations
3.3.1 Solution Criteria
3.3.2 Learning Heuristics
3.3.3 Exploration Strategies
3.4 Conclusion
4.1 Introduction
4.2 Related Work
4.3 State Representation Learning from Demonstration
4.3.1 Demonstrations
4.3.2 Imitation Learning from Demonstrations
4.4 Goal Reaching
4.4.1 Experimental Setup
4.4.2 Results and Discussion
4.5 Ballistic Projectile Tracking
4.5.1 Experimental Setup
4.5.2 Results and Discussion
4.6 Conclusion
5.1 Introduction
5.2 Related Work
5.3 Proposed Method: XSRL
5.3.1 State Transition Estimator
5.3.2 Discovery in the Face of Uncertainty
5.3.3 Optimization Process
5.4 Experimental Setup
5.4.1 Baselines
5.4.2 Environment Details
5.4.3 Implementation Details
5.5 Experimental Results
5.5.1 Evaluations of XSRL Representations and Exploration
5.5.2 XSRL Representations Transfer
5.6 Discussion
5.7 Conclusion
6.1 Conclusion
6.1.1 General Contributions
6.2 Discussion
6.2.1 Generalization and State Representation Properties


Related Posts