Learning to interact with the environment: Markov Decision Process formalism

Get Complete Project Material File(s) Now! »

Discrete vs Continuous Domains

Action spaces are split in two general categories : Discrete Finite Action Spaces (DAS), and Continuous Bounded Action Spaces (CAS).
Discrete Finite Action Spaces (DAS) take two forms. The simplest one is an Index Form, where each action is an integer between 0 and k 1 where k is the total number of actions. At each timestep, an agent chooses one action among this set. In simple grid worlds, four actions are available (up, down, left, right); the Arcade Learning Environ-ment (Bellemare et al., 2013) uses between 4 and 18 discrete actions (game dependant), Dialog systems choose among a fixed number of utterance or word (Chandramohan et al., 2010).
The second type of action space is Vectorial. At each timestep, the agent acts by sending a vector of integers. In many applications, each dimension represents an independent com-ponent. For example, one dimension steers the horizontal axis, and the second dimension the vertical axis. Traditionally, practitionners convert the vectorial form to a index one by taking the prod-uct of each dimension (Mnih et al., 2015; Kempka et al., 2016). However, the original structure is lost, and the combinatorial size might create issues (see Section 2.1.3). The other way around (index ! vectorial) is not straightforward, Sharma et al., 2017; Phelps, 2020 group actions based on expert knowledge. For example, Sharma et al., 2017 trans-forms the action domain in Atari to have three independent axes (horizontal movement, vertical movement, fire).
Continuous Bounded Action Spaces (CAS) Similarly to DAS, CAS can have a single dimension (scalar continuous action) or be multi-dimensionnal (an action is a vector). Continuous space is usually associated with robotic setups (Todorov et al., 2012; Levine et al., 2016). Thus, CAS are bounded: they can only take on values within a finite interval due to physical constraints (Chou et al., 2017). All Mujoco (Todorov et al., 2012) environments falls under this category.
Diﬀerent Action Spaces, Diﬀerent Methods DAS can be tackled by any type of model-free methods : Value-based learning algorithms such as Q-Learning (Watkins et al., 1992; Mnih et al., 2015; Hessel et al., 2018), policy-gradient (Williams, 1992) or actor-critic methods (Mnih et al., 2016; Schulman et al., 2015; Schulman et al., 2017). On the other hand, CAS can only rely on policy-gradient and actor-critic (Lillicrap et al., 2016; Fujimoto et al., 2019). Value Learning algorithms can not tackle continuous domains in their current definition for two reasons. Greedy action selection is done by taking the value function’s max. This max operation is not straightforward in very large or continuous domains, which makes greedy selection impossible. In Q-learning’s case, boostrapping is also impossible due to the max operation in the update (see Eq. (1.7) and Section 2.1.3.6). And secondly, value-network’s policy are implicit. Thus, selecting actions by sampling from the policy is impossible. Actor-critics are designed to solve this problem, the actor converging to an explicit sampler of the critic’s distribution (Lazaric et al., 2008).
From Continuous Methods To Discrete Policy gradient methods (on-policy) are less sample-eﬃcient than oﬀ-policy (value-based) methods (Ibarz et al., 2021). Value-based techniques benefit from bootstrapping and tend to have less variance, which is more eﬃcient. Oﬀ-policy techniques (see Section 1.4.4) can benefit from re-using past trajectories to update the value function, which adds to the eﬃciency. One last problem regarding continuous domains, they are harder to explore. Discrete actions can easily be counted; thus, adapting the exploration to the time an action was taken is simple. However, in continuous domains, agents need to model an action density and explore less selected regions, which is challenging. Thus, continuous domains are harder to work with and practitioners, when possible, switch to discrete domains by relying on discretization (Jaśkowski et al., 2018; Andrychowicz et al., 2020; Tavakoli et al., 2018). Tang et al., 2020b compared PPO/TRPO with diﬀerent discretizing-level to their continuous coun-terpart and show that DAS are more robust and achieve better asymptotic performance. However, discretizing requires careful tunings. A coarse scheme brings inaccuracy, and an overrefine increases exponentially the action domain. Recent articles propose an adaptive discretization for unidimensional continuous space but still need to be adapted for higher Designing action space Few theoretical analyses explore how shaping the action space impacts learning. An empirical evaluation, assessed how modifications to the action space alleviate the exploration problem (Kanervisto et al., 2020). The broad conclusion is that reducing the action space to the minimum necessary seems to be a viable option. RL algorithms struggle to discover contextual actions, actions that work only in certain con-texts, as discussed in subsequent section (Section 2.1.3.4) and in a following contribution (Section 2.3). Removing those actions might prevent convergence to the optimal policy but alleviates exploration problems by reducing the search space. For example, capping the force in Mujoco might prevent an agent from running but reduces the problem of learning to stand or to walk. Overall, those reduction techniques are done by hand and set in stone, but they drastically impact learning. Data-driven methods replaced carefully designed feature engineering, and handcrafted action space should be addressed with the same spirit.

Tackling Large Action Spaces

While using function approximation, the bigger the action space, the more accurate the approximation needs to be (Thrun et al., 1993). As a result, model training is unsta-ble and rarely used actions are overestimated (Bahdanau et al., 2016; Zaremba et al., 2016). This problem poses a significant challenge in many cases described earlier: Fine discretization, vectorial space, or environments where the number of actions is enormous such as recommender systems, language models, and industrial plants. We introduce three concepts to analyze the complexity of action spaces: Minimality, action similarity, and entanglement.

Space Minimality and Contextual Ineﬀectivness

A minimal action space is defined as a space that restricts its actions to the one used by the optimal policy . Piot et al., 2017 elaborates on a similar notion called associated set policy. The associated set-policy of indicates, for each state, the set of action that might be choosen by . A minimal action space would be the union set policy of (all the actions required to follow ). For example, the pause button will never be used in 99% of games, thus ignoring this action is beneficial. Some actions might be more subtlely useless, such as a steering angle never used in a driving game, but overall this type of action is not overrepresented in the current setup. Thus, we introduce a Local Action Space (LAS) A s defining the set of available actions in state s. LAS can also be non-minimal if is not using certain actions in s. Finding every local action space essentially means discovering the optimal policy, but quickly discovering some LAS and generalize to unseen state might alleviate sample-complexity.
LAS allows the definition of Contextual Ineﬀectiveness (CI) : actions which are non-eﬀective in certain context. For example, in games where some interactions are triggered only in front of objects (’pick up’ or ’open’). Discovering which actions are relevant in which context can drastically increase the sample eﬃciency by easing exploration. A follow-up section (Section 2.1.3.4) describes in which environment CI happens and a contribution (Section 3.2) addresses contextual eﬀectiveness, using it as an exploration signal. More generally, when considering a class of MDPs, an action set can be minimal for certain instances but not for others. For example, multi-task agents might use a set of actions to solve one task (opening doors) and leave aside inventory, while another task does not involve doors and requires picking up objects. Thus, taking into account the goal when considering contextual ineﬀectiveness should be of interest to future works.

Facilitating learning by leveraging action similarity

If many actions trigger similar outcomes, an agent wastes time trying all of them in-dependently. Transferring knowledge between related actions is key to increase sample eﬃciency. Action similarity and contextual ineﬀectiveness are not completely orthogonal: If action A and B are similar and A is non-eﬀective in particular contexts, it would be appropriate to transfer this knowledge to B. For example, in text-based games, if the ac-tion « pick-up sword » is not working, « pick-up katana » is likely to fail too. The following section on action embeddings addresses this problem.

Action Embeddings

Intuition from NLP and Vision Embeddings learning is a subset of representation learning (Bengio et al., 2013), it converts discrete variables to a continuous domain. For example, words in NLP are one-hot encoded. However, one-hot representation cannot encode semantic proximity between words. Thus, embeddings have been mostly used in NLP to organize words in a space where similarity and composition can be easily computed (Mikolov et al., 2013). Words like « crown » and « hat » should be close in this space, and « atomic » should be quite far. How to build such space from data? A simple word embedding method could take the following form: two words are converted from one-hot encoding to a continuous vector using a MLP. Both vectors are brought closer using a mean-square error loss if two words appear in the same sentence. In this example, the space is constrained directly, but Word2Vec (Mikolov et al., 2013) constrained the space using a downstream completion task and backpropagating through the network trains the embeddings. Computing embeddings for inputs is now standard for many NLP tasks (Pennington et al., 2014; Brown et al., 2020) but computer vision methods (Akata et al., 2015) propose to embed label (or class) in a classification task. For example, « baboon » and « gibbon » should have a closer pixel distribution than « car, » this type of information should alleviate learning and generalization to new classes. RL methods could benefit from the same general ideas. For example, in a dialog system (Chandramohan et al., 2010; Gao et al., 2018; Ferreira et al., 2013), each action is a dialog act (or a sentence). Algorithms could benefit from knowing that some actions are similar: « Greetings! » and « Hello » lead to closer outcomes than « Calm down, take a pill. »
Using pre-computed embeddings in RL van Hasselt et al., 2009 describes a method to use a continuous domain to tackle a discrete action space. In mountain-car, instead of using a set of discrete forces, they compute the policy in the continuous domain and then discretize the action. It enables generalization between similar actions, which reduces sample complexity. However, the action space consists of a unidimensional steering force, which is already almost continuous and smooth. Thus, 1. finding the closest action is easy 2. The continuous domain already encode similarity. However, many domains require a diﬀerent approach as action similarity can not be easily computed, and finding the closest discrete action can take time.
Dulac-Arnold et al., 2015 generalizes van Hasselt et al., 2009’s method to higher dimen-sional domains, allowing any pre-defined embeddings. To do so, the policy outputs a con-tinuous action and uses a nearest-neighbor algorithm (Fix et al., 1989) to find the closest discrete action. To pre-compute action embeddings, Tessler et al., 2019 uses predefined word embeddings (such as Word2Vec) and they compose actions using an Orthogonal Matching Pursuit algorithm. The following section proposes to improve upon pre-computed embeddings or design al-gorithms that can learn them concurrently with the policy.
Learning Action Embeddings Tennenholtz et al., 2019 proposes Act2Vec, an embed-ding method using expert demonstrations. Similar to Word2Vec (Mikolov et al., 2013), they encode action with respect to their surroundings. Actions that appear in the same context should have a similar function, thus should be close in the representation space. They showed that action embeddings encode a notion of similarity between actions, and clusters represent high-level semantic concepts. They also propose to go beyond 1-step action embeddings and embed sequences of actions (see Fig. 2.3).
Chandak et al., 2019 learns an inverse model (predicting the action between st and st+1)), using supervised learning and few collected trajectories to build an action representation. This constrains the embedding to contain information about the transition. Chen et al., 2019 refines this approach by using a probabilistic transition model. Finally, Pritz et al., 2020 builds an action representation coupled with state representation, learned alongside the policy. They also showed that embeddings trained alongside the policy were able to transfer quickly to new action domains. We expect to see more work in this direction as action representation increases eﬃciency, generalization, and adaptation to new actions.

Discussion on Contextual Ineﬀectiveness

We previously described how to handle actions similarity, and now look to reduce such large action space by detecting and discarding useless action as defined in Section 2.1.3.1.
Formally, we defined Contextual ineﬀectiveness (CI) as actions that, in some situations, do not modify the state. Thus, learning to ignore this action set quickly is key to increase the sample eﬃciency of RL algorithms. Subsequent Section 2.3 and Section 3.2 are framed under this paradigm.
Availability known before acting Some environments already remove unavailable actions when not required by the context. Thus, at every time step, the action set can vary. For example, some Dialog System, depending on the conversation’s stage, propose a limited amount of utterances. To cope with this challenge, He et al., 2016b; He et al., 2016a propose a Q-Learning approach. They compute a state, and action embedding for each action available, then perform a dot product between the two. Since the number of actions is small, action selection and bootstrapping using an argmax are quick.

Table of contents :

1 Deep Reinforcement Learning Cookbook
1.1 Machine Learning and Deep Learning Background
1.1.1 Machine Learning Ingredients
1.1.2 Exemple of applications
1.2 Deep Learning
1.2.1 What tools are necessary to follow this thesis?
1.2.2 A brief history of Deep Learning
1.3 Reinforcement Learning
1.3.1 Learning to interact with the environment: Markov Decision Process formalism
1.3.2 How to reinforcement learn?
1.4 Deep Reinforcement Learning (DRL)
1.4.1 Deep Q-Learning (DQN)
1.4.2 Dealing with Partial Observability
1.4.3 Scaling up
1.4.4 Exploration in Reinforcement Learning
1.5 Chapter Conclusion
2 Of Actions And Constraints
2.1 Action Space Zoo
2.1.1 Domains example
2.1.2 Discrete vs Continuous Domains
2.1.3 Tackling Large Action Spaces
2.1.4 Working with Hybrid Action Spaces
2.1.5 More Exotic Action Spaces
2.1.6 Conclusion
2.2 Deep Garcia and Fernandez
2.2.1 Worst Case Min Max formulation
2.2.2 Risk-Sensitive Criterion (CVaR)
2.2.3 Constrained Criterion
2.2.4 Conclusion
2.3 First contribution (IJCNN’19): I’m sorry Dave, I’m Afraid I Can’t Do That – Deep Q-learning from Forbidden Actions .
2.4 Method
2.4.1 Feedback Signal and MDP-F
2.4.2 Frontier loss
2.5 Experiments
2.5.1 MiniGrid Enviroment
2.5.2 TextWorld Environment
2.5.3 Model and architecture
2.6 Results
2.7 Conclusion
3 Exploring using Action Relevance
3.1 Press E to Explore
3.1.1 Undirected Methods in Exploration
3.1.2 Directed Methods in Exploration
3.2 Second Contribution (IJCAI’21): Don’t Do What Doesn’t Matter, Intrinsic Motivation from Action Usefulness
3.3 Notation Adjustment
3.4 Don’t Do What Doesn’t Matter!
3.4.1 Intuition
3.4.2 Method
3.4.3 Decay illustration
3.5 Experimental settings
3.5.1 MiniGrid Environment
3.5.2 Experimental Setting
3.6 Experimental Results
3.6.1 Base environment
3.6.2 Intrinsic exploration behavior
3.6.3 Intrinsic Motivation Pitfalls
3.7 Conclusion
4 Abstraction and Goals
4.1 Introduction
4.1.1 Vision and Language Navigation
4.2 Goal-Conditionned Reinforcement Learning
4.2.1 Background And Notation
4.2.2 Hindsight Experience Replay
4.2.3 HER variants
4.3 Third Contribution (SCCI’20): HIGhER : Improving instruction following with Hindsight Generation for Experience Replay
4.4 Hindsight Generation for Experience Replay
4.5 Related Methods
4.5.1 Conditioned Language Policy
4.5.2 IRL for instruction following
4.6 Experiments
4.6.1 Experimental Setting
4.6.2 Instruction Generator Analysis
4.6.3 HIGhER for Instruction Following
4.6.4 Discussion
4.6.5 Language Learned by the Instruction Generator
4.6.6 Complementary Experiment
4.6.7 Limitations
4.7 Conclusion
5 Turning Supervised Learning Into Multi-Turn Interactive Tasks
5.1 Active Learning and Reinforcement Learning
5.2 Sequential Representation Learning
5.3 Fourth Contribution (INTERSPEECH’20): Machine of Few Words – Interactive Speaker Recognition with Reinforcement Learning .
5.4 Interactive Speaker Recognition Game
5.4.1 Game Rules
5.4.2 Game notation
5.4.3 Modelling the Speaker Recognition Module
5.5 Speaker Recognition as an RL Problem
5.5.1 Markov Decision Process
5.5.2 Enquirer optimization Process
5.6 Experimental Protocol
5.6.1 Dataset
5.6.2 Audio Processing
5.6.3 Speaker Recognition Neural Modules
5.7 Experiments
5.7.1 Guesser Evaluation
5.7.2 Enquirer Evaluation
5.8 Conclusions and Future Directions
List of Acronyms
List of Symbols