Environment and attention-action coordination

Get Complete Project Material File(s) Now! »

Environment and attention-action coordination

We designed a virtual pick-and-place task in a grid world environment, shown in Figure 2.1 to evaluate our future agent. Two infinite sources of blocks are located in two corners of the grid, while two boxes are located in the remaining corners. A move cannot be made against a wall, and the agent (the hand in Figure 2.1) is positioned at the center of the grid at the start of an experiment. The agent state consists of 11 different features: its xa and ya coordinates in the room, whether it carries a block, and the xi and yi coordinates of all four objects in the grid.

Experiments on task-oriented exploration

Our first set of experiments aims at finding out whether there can be some synergy between intrinsic motivation and gaze following, and in particular studies how the weights of the incentive for gaze-following and that of curiosity impact the discovery and the achievement of a classically defined RL task in a new environment.
Task definition The task here is called any: the agent must repeatedly go to any block, pick it up, carry it to any box, and place it inside. The tutor’s policy in this environment is fixed for each task. Each time the agent ends up holding nothing during exploration, the tutor picks a source of blocks, choosing randomly. Then the tutor keeps looking at the source until the agent takes a block from one of the two sources. When the agent picks a block, irrespective if it is the one the tutor looked at, the tutor chooses a box randomly and looks at it until the agent has placed the block in it. Of course, this policy is an oversimplification of natural gaze mechanisms. It can be understood as a way to model how a caregiver could point towards an object for a child to interact with, and keep sending attentional signals in the direction of the object until the child has indeed interacted with it.

Reward-free environment

The previous experiments were conducted in the context of learning how to achieve a specific task, defined by rewards provided by the environment itself. As seen in introduction, this kind of rewards is becoming a limitation for RL agents to address open-ended learning. On top of that, predetermined reward functions have a few undesirable side effects.
Undesirable side effects of reward engineering A first issue appears when an agent with strong learning capabilities ends up hacking flaws in the reward design to accumulate reward without actually performing the corresponding task (Irpan 2018). This phenomenon rose recently with the increasing use of powerful function approximators in deep RL and often leads the authors to define overly complicated reward functions that prevent failure modes (Popov et al. 2017). A second more common issue encountered is when its definition leads the agent to fall and stay trapped in poor local optima during exploration (Irpan 2018). While such behaviors are mainly due to the difficulty of the exploration-exploitation dilemma, they also show that the fixed motivation coming from a reward signal designed once and for all by an engineer may be suboptimal for exploration, compared to a more adaptive motivation signal as advocated by developmental approaches.
In part to avoid these effects, and overall because extrinsic rewards are flawed, we now carry out experiments with our intrinsically motivated gaze-following agent on a setup where there is no reward coming from the environment, in order to evaluate whether it can still reproduce the task the tutor is indicating with its social signals.

Goal-conditioned RL as multi-task RL

We have presented deep and goal-conditioned RL, and we now propose to think about the possibilities offered by this augmented RL paradigm. First, we want to relate goal-conditioned RL to multi-task RL. Essentially, intention-directed learning is multi-task learning where the task is explicitly and internally represented by the agent, and where the agent actions are conditioned on this representation. Let us have a look at this claim in detail.

Clearly identical transition functions

Now we consider a second case for multi-task RL, where the transition functions associated to the task are the same, this time clearly and not after formal considerations, but where each task is associated with a specific reward function while visited state distributions clearly overlap between tasks, contrary to the previous case.
This way to describe the setting can seem opaque, but it describes most multi-task continuous control environments: a simulated arm for example can learn to achieve two different tasks like pushing and picking, or try to reach two different goals, but in all cases the different learning situations are defined by different reward functions while nothing prevents the agent from going through the same state when pursuing different tasks. To avoid the perceptual aliasing problem mentioned in the previous case, we must differentiate in some way the two learning situations. Goal-conditioned RL operates this differentiation by parameterizing the learning situation, over goals for example, and by augmenting the state space of the problem with this parameterization. In doing this, goal-conditioned RL transforms a multiple overlapping MDPs setting into a Standard multi-task setting rigorously formally identical to that of the first case.
Indeed, once the state space is augmented with a parameterization p of the learning situation, we can define a single reward function of the augmented state space r(s, p) (the transition function was already common) like in Dynamic Goal Learning and uvfas, and the different tasks once again only differ by the initial augmented state distribution, while the steady state distribution of any policy remains entirely separated from that of another task with another parameterization.
The main difference this time is that this last property is more due to the different nature of task parameterization with respect to states (they do not change under the agents action) than to the real non-augmented transition function of the environment.

READ  Patching as graphical programming environment

Table of contents :

1 Introduction 
1.1 Motivations
1.1.1 A short history of RL
1.1.2 Limitations of RL and deep RL
1.1.3 Definition and challenges of developmental robotics
1.1.4 The origins of developmental robotics
1.1.5 Developmental robotics and RL
1.2 Dissertation focus
1.2.1 Towards developmental interactive RL
1.2.2 Towards task-independent RL
1.2.3 Conclusion
1.3 Outline of the dissertation
1.3.1 Attention coordination and novelty search for task solving
1.3.2 Intentionality and competence progress for control
1.3.3 Competence progress and imitation learning for object manipulation
1.3.4 Discussion and conclusion
2 Combining novelty search and gaze-following 
2.1 Background
2.1.1 Reinforcement learning background
2.1.2 Texplore
2.2 Methods
2.2.1 Environment and attention-action coordination
2.2.2 Gaze following motivation
2.2.3 Final algorithm
2.3 Experiments on task-oriented exploration
2.4 Reward-free environment
2.4.1 Experiments
2.5 Analysis of the results
2.5.1 Globally related work
2.5.2 A first limitation
2.6 Conclusion
3 Goal-directed learning 
3.1 Goal-conditioned RL
3.1.1 RL and deep RL
3.1.2 Goal-conditioned RL
3.1.3 Goal-conditioned RL as multi-task RL
3.2 Curriculum learning
3.2.1 The Learning Progress Hypothesis
3.2.2 SAGG-RIAC plus goal-conditioned RL
3.2.3 Curriculum learning: related work
3.2.4 Limitation of an analysis
3.2.5 A theoretical standpoint
3.2.6 Conclusions
3.3 Accuracy-based curriculum learning in RL for control
3.3.1 Motivations
3.3.2 Methods
3.3.3 Results
3.3.4 Discussion
3.4 Conclusions
4 Curriculum Learning for Imitation and Control 
4.1 Multi-task, multi-goal learning
4.1.2 Limitations
4.2 CLIC
4.2.1 Motivations
4.2.2 Outline
4.3 Environments
4.3.1 Modeling objects
4.3.2 A second agent Bob
4.3.3 Environments instances
4.4 Methods
4.4.1 Multi-object control
4.4.2 Single-object control
4.4.3 Imitation
4.4.4 Curriculum learning
4.4.5 Summary and parameters
4.5 Results
4.5.1 Imitating Bob
4.5.2 Following Bob’s teaching
4.5.3 Ignoring non-reproducible behaviors from Bob
4.5.4 Ignoring Bob’s demonstrations for mastered objects
4.6 Analysis
4.6.1 Related work: Socially Guided Intrinsic Motivation
4.6.2 Related work: other approaches
4.7 Conclusion
5 Discussion 
5.1 Contributions
5.1.1 A systemic approach to intelligent agents
5.1.2 Intrinsic motivation, attention synchrony and RL
5.1.3 Intention generation, intention selection and RL
5.1.4 Object control, intention selection, imitation and RL
5.2 Environments: limitations and perspectives
5.2.1 Arguments for ad hoc simple environments
5.2.2 Limitations
5.2.3 Perspectives
5.3 Agents: limitations and perspectives
5.3.1 Off-policy learning biases
5.3.2 Interaction limitations and perspectives
5.4 A discussion on objects
5.4.1 Object control: desired properties
5.4.2 Existing models
5.5 Environment control: a meta-RL point of view
5.5.1 Meta-RL
5.5.2 Goal-conditioned RL as meta-RL?
5.5.3 Curriculum learning as unsupervised meta-RL
5.5.4 Goal space identification
5.6 Observational and hierarchical RL: a global perspective
5.6.1 Hierarchical RL
5.6.2 Observational learning
6 Conclusion 


Related Posts