Robots as Embodied Agents
In this section we have a closer look on characteristics that define a robot, because these characteristics are important when addressing the question how we can enable robots to learn. Robots can be considered as a special type of agents. An agent is an identity that is capable of making decisions. While theoretically these decisions could be random, in most cases these decisions will somehow be based on information gathered by or provided to the agent. Rational agents will try achieve the best outcome based on some objectives. While there might be useful applications of non-rational agents, we will consider agents as rational.
Following this definition robots are certain type of agent – robots are embodied agents. Technically a robot could be considered as a (software) agent reading and processing information coming from sensors and controlling certain hardware. However, we will consider all these parts together as integral parts making up one robot identity. In this sense, a robot is more than just an agent that comes with its own problems and advantages.
The most obvious and striking diﬀerence between a robot and a virtual agent is the fact that the robot can interact with the real world. The capability of interacting and manipulating objects in the environments has been used to learn diﬀerent interesting tasks such as locomotion (Jun Nakanishi et al., 2004), the game « ball in a cup » (Kober, Mohler, and Peters, 2008) and table tennis (Muelling, Kober, and Peters, 2010)). The capability to interact with the real world makes robots well suited to automate tiring or even dangerous tasks that otherwise would need to be executed manually by humans. Thus, these kinds of robots are widely deployed in industry.
However, robots can not only be used for automation, but also in a social context. Thus, research on social robots, robots that are able to communicate and engage in social interactions with humans has recently got more attention (Fong, Nourbakhsh, and Dautenhahn, 2003; Bütepage and Kragic, 2017; Dautenhahn, 2007). While most striking, the capability to physically interact is not the only advantage of using robots over virtual agents. Already the physical presence of a robot can yield its advantages. The work of Leyzberg et al. (2012) shows that the use of a robot increased learning gains for a human learner in comparison with a pure virtual system. Furthermore, a review on social robots for education (Belpaeme et al., 2018) identifies three advantages of robots over virtual systems. The first two already mentioned advantages are the capability to interact with the real world and increased learning gains for the human learner. The third advantage is that users show more social behavior beneficial for learning.
While using robots over virtual systems comes with advantages, it also comes with its own challenges and problems. These problems can be of two diﬀerent types. The first type contains problems that directly concern the hardware. The second type contains restrictions on the software. These restrictions derive from the fact that hardware is used, but does not concern the hardware directly.
One considerable aspect concerning the hardware directly is the financial aspect: robotic systems are usually considerably more expensive than virtual systems. Not only the acquisition cost are higher, but also maintenance, since robots are exposed to wear and tear. They can break and malfunction for mechanical reasons, and unfortunately they often do in inappropriate moments. Even if they function properly, conducting robot experiments is time consuming. Somebody has to be around to ensure a smooth execution and verify that nothing goes wrong. Ensuring that multiple experiments in a row have the exact same conditions is diﬃcult, even more so running multiple experiments in parallel. Furthermore, depending on the robot, malfunctions can be physically dangerous to humans interacting with or operating the robot.
On the software side we have the problem that typical assumptions that are often made in machine learning do not hold in robotics. Usually, it can neither be assumed that the true state is fully observable nor that the data is noise free.
Also the high-dimensional continuous state and action space is rather large (Kober, Bagnell, and Peters, 2013). While it is possible to simulate the robot, it is quite unrealistic that the robot will match this behavior in the real world, as a consequence, the algorithms that are being used need to be robust with respect to models not capturing all details of the real system correctly (Kober, Bagnell, and Peters, 2013).
Overview of Approaches to Robot Learning
Reward Evaluative Corrective Guidance Instruction Demonstration function Feedback Feedback autonomous exploration human control Najar and Chetouani (2021).
After having a better understanding of the characteristics of a robot, we now turn to an overview of commonly used approaches to enable robots to learn. These approaches can be located on the exploration-control spectrum (Najar and Chetouani, 2021; Breazeal and Thomaz, 2008) as shown in Fig. 3.1. On the left side of the spectrum we find approaches where the agent learns autonomously like RL (Sutton and Barto, 1998). RL provides a mathematical framework to implement the idea of trial-and-error learning that has a broad corpus of research, particularly in robotics (Kober, Bagnell, and Peters, 2013). Classical reinforcement learning relies purely on the agent to explore the eﬀects of its action on the environment. On this side of the spectrum the agent has a high autonomy and learns by itself.
When moving towards the right of the spectrum, the control influence of the human on the learning process increases. Coming from classical reinforcement learning we move to approaches that integrate feedback that the agent receives on taken action from a human tutor. These approaches are often combined with RL. However, how to integrate the feedback into the learning algorithms needs research on its own (e.g. Knox and Stone, 2012b; Li et al., 2019).
If we move further on the spectrum, we find guidance and instruction. These approaches limit the set of possible actions or suggest optimal actions (Thomaz and Breazeal, 2006a). On the right corner of the spectrum we find the idea of demonstrations. This idea is implemented with the LfD framework (Argall et al., 2009; Calinon, 2019). The LfD approach is a commonly applied approach for robots learning new skills from humans, where the human demonstrator demonstrates how to solve a certain task to the robot. The robot learns from these demonstrations how to solve this particular task.
Except for classical reinforcement learning, all approaches on the spectrum can be counted toward interactive learning methods. In interactive learning approaches the teaching signals to an agent can be achieved via a variety of teaching channels like natural language (Paléologue et al., 2018; Cruz et al., 2015; Kuhlmann et al., 2004), computer vision (Atkeson and Schaal, 1997; Najar, Sigaud, and Chetouani, 2019, computer code (Maclin et al., 2005; Torrey et al., 2006, artificial interfaces (Abbeel, Coates, and Ng, 2010; Suay and Chernova, 2011; Knox, Stone, and Breazeal, 2013) or physical interaction (Akgun et al., 2012). Najar and Chetouani (2021) identify two main categories of teaching signals based on how they are produced: advice and demonstration. While these teaching signals could use the same channel, they are fundamentally diﬀerent as the demonstration requires task execution and advice does not. In other words, demonstrations rely mainly (if not exclusively) on the task channel characteristics of the communication channel, while advice relies mainly on social channel characteristics (see Section 2.4).
Furthermore, Najar and Chetouani (2021) define advice as: « teaching signals that can be communicated by the teacher to the learning system without executing the task ». Based on these considerations Najar and Chetouani, 2021 propose the following taxonomy of advice:
• General advice can be used to provide prior information on the task before the learning starts. It can be split into general constraints and general instructions.
• General constraints include information about the task such as domain concepts, behavioral constraints and performance heuristics.
• General instructions explicitly specify what actions to perform. It can either be provided in form of if-then rules or as detailed action plans.
• Contextual advice is provided during the task. It is dependent on the current state of the teacher-agent setting. It can be split into guidance and feedback.
• Guidance informs about future actions. In the most specific sense, it aims at limiting the set of all possible actions to a sub-set that is favored by the teacher.
• Contextual instructions are a particular type of guidance where only one action is suggested by the teacher.
• Feedback informs about past actions taken by the agent. It can be split into corrective and evaluative feedback.
• Corrective feedback can consist of either a corrective instruction or a corrective demonstration.
• Evaluative feedback can be provided in diﬀerent forms. These include scalar values, binary values, positive reinforcer or categorical information. Also preferences between alternatives can be counted towards evaluative feedback.
Humans Teaching Robots
Vollmer and Schillingmann (2018) provide a review over studies presenting teaching interactions with a robot learner and a human teacher that also report on the human teaching behavior. This research is sparse and while the authors did not claim exhaustiveness, they only found 18 papers matching these criteria. While all of these papers studied teaching interactions, only in five (28%) of the studies the robot actually learned something. They mention two possible reasons: the high implementation eﬀort of a suitable learning algorithm and the introduction of undesired variability into the study.
In Khan, Mutlu, and X. Zhu (2011), the authors investigate how humans choose examples to teach a task that corresponds to a 1-dimensional classi-fication task (as explained in Section 4.3) to a robot. The task consists of ordering pictures of objects along a line of how graspable they are, and then provide examples to teach this graspability to a robot. The authors found the three following strategies: The extreme strategy that corresponds to curriculum learning (examples on both extreme sides), the positive only strategy, where people only gave positive examples and the linear strategy, where people moved from left to right (or vice-versa). Furthermore, the work mentions the boundary strategy, examples on both sides close to the decision boundary corresponding to the strategy predicted by the teaching dimension, however they authors could not find empirical evidence for this strategy.
Cakmak and Lopes (2012b) extends AT to an optimally teaching sequential decision tasks. In this work an IRL agent learns from human demonstrations. The authors find that the natural teaching behavior is normally sub-optimal, but that spontaneous optimal teaching is possible. Furthermore, they find that providing instructions to people on how to provide optimal examples improves teaching behavior. The improvement shows in the reduction of the uncertainty in the estimation of the rewards.
Similarly, the work of Cakmak and Thomaz (2014) investigates human teaching behavior in three classification tasks: faces, animals and gestures. The authors find that natural teaching is not optimal and hypothesize that because human teaching is largely optimized for human learning, they might not understand the inner working of an artificial learner. To improve the teaching behavior they propose teaching guidance and they show that their system guiding the human how to select teaching examples increases the learning performance of the artificial learner by increasing the accuracy.
The work of Sena, Zhao, and Howard (2018) also addresses the problem of how to provide a set of good quality demonstrations by giving teaching guidance to the human. In this work the authors apply teaching guidance to a task, where the robot has to learn a trajectory from a starting zone to a goal. They furthermore give visual feedback on the learner model after the teaching phase. Their approach improves the teaching eﬃciency that they determine by the ratio of generalisation performance against the required number of demonstrations by approximately 180%. While we see that there is upcoming research on human teaching behavior to robots or virtual agents, this research is still sparse and limited to simple tasks. Some of this research focuses on teaching humans how to be better teachers, and the research aiming for understanding human behavior and how to better learn from it is even more sparse. In Chapter 7 we address this by investigating human teaching behavior to a robot for a sensorimotor task.
Model Application to Implemented Research
In the previous section we described our full specific model having a full interaction loop. In this section we describe how the specific model relates to our research questions and how we (partially) apply it in our implemented research.
In our user study (see Chapter 7), we focus on the human side of the commu-nication by addressing the questions Do humans make use of social channel characteristics when teaching robots a sensorimotor task? (Q1) and Are neg-ative demonstrations useful to enrich approaches that use demonstrations to learn? (Q2).
In the first experiment of the user study (see Chapter 7) we address the question Does human behavior change when teaching a robot how to solve a task as opposed to just solving the task? (Q1a). We do this by comparing two conditions with each other: In the first condition we ask humans solve a sensorimotor task, and in the second condition we ask humans to teach how to solve a sensorimotor task to the robot.
According to our hypothesis, in the solving condition, humans will use sensori-motor actions just to modify the environment and will not try to communicate anything else to the learner (here the robot). The communication model for the solving condition is shown in Fig. 6.4. In the teaching condition humans will make use of task- as well of social channel characteristics. The specific model for the teaching condition is shown in Fig. 6.5. In order to address Q2, the teaching condition included negative In the second experiment of our user study (see Chapter 7) we address the question Do humans perceive this teaching behavior as more informative than the the solving behavior? (Q1b). We do this by showing the demonstrations we collected in the first experiment to new participants. In this setting the human becomes the learner, but does not know from which condition the data came from. However, the teacher here is not a robot, since the participants knew that they were shown data created by humans. The model for the human perception is shown in Fig. 6.6.
Integrating Observer Feedback on Legibility into Interactive RL
In this work we are interested in the combination of a RL system with an observer that reasons about the goals of the learner to increase the legibility of the learned trajectories. In order to achieve this we use a MDP (see Section 3.4) in combination with reward shaping (see Section 3.4) to model the learning problem. We add the observer to the equation by modeling the observer with diﬀerent strategies to estimate how likely the agent is going for the target goal.
Table of contents :
List of Figures
List of Tables
List of Acronyms
1.2 Research Approach
1.3 Thesis Outline
1.6 The Animatas Project
II Background and Related Work
2 Cognition and Communication
2.2 The Code Model
2.3 Theory of Mind
2.4 Social- and Task Channel
2.5 Ostensive-Inferential Communication
2.6 Sensorimotor Communication
3 Approaches to Robot Learning
3.2 Robots as Embodied Agents
3.3 Overview of Approaches to Robot Learning
3.4 Reinforcement Learning
3.5 Learning from Demonstration
4 Teaching Machines and Robots
4.3 Machine Teaching
4.4 Humans Teaching Robots
5 Observer Related Metrics
III Implementation of Research
6 Communication Model
6.2 General Communication Model
6.3 Specific Approach
6.3.1 Specific Model
6.3.2 Model Application to Implemented Research
7 User Study on Human Teaching Behavior Towards Robots in a Sensorimotor Task
7.2.2 Experiment 1
7.2.3 Experiment 2
8 Augmenting RL with Social Channel Usage
8.2 Integrating Observer Feedback on Legibility into Interactive RL
8.2.1 Interactive RL
8.2.3 Modeling the Observer
8.3.1 Environment 1
8.3.2 Environments 2 – 5
9 Discussion and Conclusion
9.1 Summary of Contributions
9.2 General Limitations of the Approach