Dynamics, offline learning and transfer

Get Complete Project Material File(s) Now! »

Sequential decision-making

Section 1.1.3 presented the Reinforcement Learning framework, that formulates the learning procedure as an optimal control problem. In this section, I start by recalling other design principles that have been considered for coming up with a good driving policy π.

Motion Planning

The development of motion planning techniques for intelligent vehicles dates back to the late 80s, supported by international research projects such as Eureka (1987) of the Prometheus program, followed by the DARPA Grand and Urban Challenges (2004, 2007), and more recently the VIAC (2010), GCDC (2011) and Delphi (2015) challenges. In two surveys (González et al., 2016; Paden et al., 2016) studying the literature of this period, the authors identified three main approaches. Search-based algorithms This method is based on a regular discrete partition of the state space S called a lattice, which must be connected by feasible trajectories (e.g. Pivtoraiko and Kelly, 2005). This framing reduces motion planning to the problem of finding a shortest path in a known graph. Then, traditional graph-search algorithms such as Dijkstra’s algorithm (Dijkstra, 1959), A⋆ (Hart, Nilsson, and Raphael, 1968) or D⋆ (Stentz, 1994) can be used to compute the optimal trajectory. This technique has been applied by at least five different teams during the DARPA Urban Challenge for driving on structured roads and unstructured parking: Dijkstra for team Ben Franklin (Bohren et al., 2008) and VictorTango (Bacha et al., 2008), and A⋆ for Stanford University (Montemerlo et al., 2008) and KIT (Kammel et al., 2008), and D⋆ by the winning team from CMU (Urmson et al., 2008). Kinematics constraints can also be included in the configuration graph, as in (e.g. Latombe, 1991; T. Fraichard, 1993; Laumond et al., 1994).

Sequential decision-making

Sampling-based algorithms The limitation of search-based algorithms lies in the difficulty of formulating a regular lattice structure covering the states space S with feasible transitions, and in the real-time constraint that may not be met by graph-search algorithms. To address them, sampling-based motion planners iteratively grow a set of reachable configurations by randomly sampling valid transitions. The most popular ones are Probabilistic Roadmap (PRM) (Kavraki et al., 1996), Rapidly-exploring Random Trees (RRT) (Lavalle, 1998; Karaman and Frazzoli, 2011) and MCTS algorithms (Coulom, 2007b; Kocsis and Szepesvári, 2006). These methods have been used in the context of Autonomous Driving in (e.g. Lamiraux and Lammond, 2001; Sánchez L., Zapata, and Arenas B., 2002; D. Lenz, Kessler, and Knoll, 2016; Paxton et al., 2017; Faust et al., 2018). Optimisation-based algorithms The third approach consists in optimising a parametrised trajectory with respect to a real-valued objective function. The most popular instance is in-terpolation between the current and goal states, which has been applied to various classes of functions in the context of Autonomous Driving, such as lines and circles (Reeds and Shepp, 1990), clothoids (Funke et al., 2012), polynomials (W. Xu, Wei, et al., 2012), and Bézier curves (González et al., 2016; Artuñedo, Villagra, and Godoy, 2019).
These three approaches to motion planning thus rely on deterministic models of the vehicle dynamics. These models are often required to take a simple form so that the search or optimi-sation procedure can be solved efficiently, and other objects in the scene are often considered as static. In order to study more complex multi-agent interactions specifically, a collaborative approach to motion planning has been developed. Cooperative planning The difficulty of predicting intricate interaction patterns between mul-tiple agents can be bypassed in one particular setting: cooperative motion planning for multiple vehicles. Indeed, instead of predicting how vehicles react to one another, the behaviours of these vehicles are jointly optimised. As an effect, prediction outputs are replaced by input variables that can be chosen freely to maximise an objective function. Two main variations have been studied: coordination along fixed paths (Altché, Qian, and La Fortelle, 2016; Altché and La Fortelle, 2016; Altché, Qian, and de La Fortelle, 2017), and general unconstrained motion planning (LaValle and Hutchinson, 1998). However, this framework does not allow to represent human behaviours, or more generally any behaviour that is not explained by the objective function. In particular, that lack of communication between agents and the resulting uncertainty lead to suboptimal, uncertain and multimodal trajectories that are not handled by cooperative planning approaches.

Imitation Learning

An orthogonal strategy to motion planning techniques is to learn a reactive policy π(a|s) under supervision of an expert πE that produces a dataset D of demonstration trajectories. To that end, we optimise a parametrised policy πθ to minimise a regression loss L, such as the KL divergence to the distribution of expert action: min E [L (πθ(a|s), πE(a|s))] θ s D.
This approach is also referred to as behavioural cloning, and is particularly suited when only low-level high-dimensional inputs are available, such as camera images, which prevents access to the dynamics model required by motion planning approaches. The first application of imitation learning to autonomous driving is the ALVINN (Autonomous Land Vehicle In a Neural Network) project (Pomerleau, 1989), where a 3-layer neural network was trained for the task of road following, as shown in Figure 2.2.

Rewards and inverse reinforcement learning

Having defined the state and action spaces S and A, we come to specify which state-action pairs are deemed desirable, through the definition of the reward function R. Paradoxically enough, humans know how to drive but not necessarily how to explicit the reasons for their actions, especially in the form of a fixed evaluable objective. A common approach to reward specification is colloquially known as reward engineering, in which the reward function is typically parametrised as a linear combination of features R(s, a) = i ωiϕi(s, a) . For example, such features may include the ego-vehicle speed, its longitudinal distance to a short-term goal, lateral distance to the lane centerline, or the presence of collisions. By handling more and more use-cases, the number of features to consider will rise quickly, some of which contradicting each other. Then, the issue of how to properly choose the weights ωi remains, and it can be increasingly hard to strike the right trade-off. Besides, this difficulty is further exacerbated by an ambiguity lying in the blurry boundary between the reward function R(s, a) and the value function V (s). For instance, are safety distances desirable per se, or only because respecting them means we are less likely to end up in an accident, which is the actual feature of interest here? Likewise, do road traffic regulation rules describe rewarding states or high-value states? One practical solution to these concerns is to iteratively refine the reward function R until the corresponding optimal policy π⋆ matches the expected behaviour πE of human drivers. The careful or well-versed reader will have noticed that this approach is directly opposed to the Reinforcement Learning problem, where the optimal behaviour π⋆ stems from the reward function R. Accordingly, the aptly named Inverse Reinforcement Learning (IRL) framework aims at finding a reward function that makes the expert behaviour appear uniquely (near)-optimal. At first glance, this problem seems related to Imitation Learning formulation of Section 2.1.2 in its attempt to reproduce expert behaviour. This intuition is supported by the fact that RL ◦ IRL is the dual problem of state-action occupancy matching with respect to expert trajectories (J. Ho and Ermon, 2016). Example applications of IRL to Autonomous Driving include the work of Kuderer, Gulati, and Burgard (2015) who learn the trade-off between comfort and efficacy in human lane-change trajectories.
In addition to finding a good candidate reward for the ego-vehicle behaviour, Inverse Reinforcement Learning can also be applied for the purpose of predicting how other agents in the scenes are likely to behave, by modelling them as rational agents trying to maximise an unknown reward function to be learned. In that sense, RL ◦ IRL is a form of model-based Reinforcement Learning. For instance, this approach has been used to model routing preferences of taxi-drivers (Ziebart, Maas, et al., 2008), to predict the future trajectory of pedestrians (Ziebart, Ratliff, et al., 2009) as shown in Figure 2.8, and the behaviour of human drivers at an intersection (Sun et al., 2019) or on a highway (Sadigh et al., 2016).

Dynamics, offline learning and transfer

We now come to the last element of the Markov Decision Process tuple: the dynamics P (s′ | s, a). Contrary to the state and actions which are the input and output interfaces of the agent, and to the reward function which is generally chosen by the system designer, there is usually no need to specify the system dynamics. Indeed, one of the most significant assets of Reinforcement Learning is its ability to handle unknown dynamics that are only accessed through interaction. Unfortunately, this assumption that the agent is allowed direct interaction with the true envi-ronment is both unacceptable and unrealistic in the context of Autonomous Driving. Indeed, the traditional learning-by-acting pipeline requires exploration, which would imply having autonomous vehicles violate the rules of the road and generate accidents, which is obviously out of the question. Besides, most Reinforcement Learning algorithms require a tremendous amount of interaction data to learn an optimal policy, including for MDPs such as Atari games which are arguably less diverse and complex than real-world driving scenes. Not to mention that, perhaps evidently but contrary to simulated games, the real world can only run in real time. This issue can be addressed in two ways. A first solution is to consider the offline Reinforce-ment Learning problem (Levine, Kumar, et al., 2020), in which the agent can no longer interact with the environment and is instead provided with a static dataset D = {(st, at, rt, st+1 )} of interaction data, and must learn the best policy it can with this dataset. These interaction data are collected by an exploration policy πE , in our case, human driving. However, for the same reasons mentioned in Section 2.1.2, this limited data available induces a loss of optimality: any attempt to improve the policy may steer the agent in regions of the state space S that are not present in the dataset D, typically accident states, off-road driving, etc. in the case of Autonomous Driving. Still, safe policy improvement guarantees can be derived under some conditions, e.g. for finite MDPs (Laroche, Trichelair, and Combes, 2019; Nadjahi, Laroche, and Tachet des Combes, 2020). This derivation often requires to constrain how much the learned policy π can differ from the exploration policy πE, so as to bound the state distribution shift (Kakade and Langford, 2002; Schulman et al., 2015).

Table of contents :

List of Acronyms
List of Symbols
1 Introduction
1.1 Context and scope
1.2 Outline and Contributions
I Case Study: Learning to Drive
2 Literature Review
2.1 Sequential decision-making
2.2 States and partial observability
2.3 Actions and temporal abstraction
2.4 Rewards and inverse reinforcement learning
2.5 Dynamics, offline learning and transfer
2.6 Optimality criterion and safety
3 Problem Statement
3.1 Perceived states
3.2 Behavioural decisions
3.3 Traffic dynamics
3.4 Rewards
3.5 Implementation
II Model-free
4 Considering Social Interactions
4.1 Motivation
4.2 A social attention architecture
4.3 Experiments
5 Acting under Adjustable Constraints
5.1 Motivation
5.2 Budgeted dynamic programming
5.3 Budgeted reinforcement learning
5.4 Experiments
III Model-based
6 Planning Fast by Hoping for the Best
6.1 Motivation
6.2 Open-loop optimistic planning
6.3 Graph-based optimistic planning
7 Preparing for theWorst
7.1 Motivation
7.2 Confident model estimation
7.3 State interval prediction
7.4 Robust stabilisation and constraint satisfaction
7.5 Minimax control with generic costs
7.6 Multi-model selection
7.7 Experiments
8 General Conclusion and Perspectives
8.1 Conclusion on our contributions
8.2 Outstanding issues and perspectives
A The highway-env software
A.1 General presentation
A.2 Outreach
B Complements on Chapter 5
B.1 Proofs
C Complements on Chapter 6
C.1 Proofs
C.2 Time and memory complexities
D Complements on Chapter 7
D.1 Proofs
D.2 A tighter enclosing polytope
List of Figures
List of Algorithms
List of Tables
List of References