Get Complete Project Material File(s) Now! »

## Value function and policy

Once a decision problem is defined as an MDP, one can define a solution through a policy π : S → A that specifies for each state of the MDP what action should the agent take. Given a policy π, for each state s one can define a value function Vπ(s) : S → R that describes the sum of cumulative rewards that can be expected by the agent starting from state s and subsequently following policy π. More formally [∑∞ ] Vπ(s) = E γtR(st, π(st))|s0 = s (2.1) t=0.

where the discount factor 0 ≤ γ < 1 defines the preference of obtaining a reward imme-diately rather than delayed (e.g. to get a milkshake today rather than in ten days). This function can be recursively defined as ∑ Vπ(s) = R(s, π(s)) + γ P (s0|s, π(s))V π(s0). (2.2).

The interest of the current formalism is to find an optimal policy π∗ that maximizes the cumulative rewards of the agent over time, that is ∀π∀s ∈ S : Vπ∗ (s) ≥ Vπ(s). Note that while there can be multiple optimal policies, they all share a unique optimal value function V ∗(s) = max [Vπ(s)] . (2.3).

### Reinforcement learning

Dynamic programming algorithms are eﬃcient methods when the problem is fully known. However, it is often the case, especially in neuroscience experiments, that agents do not have a complete knowledge of their environment and need to interact and move within it to acquire useful information for eventually solving the problem.

The computational field of reinforcement learning [SB98] addresses such limitation. The agent starts with no prior knowledge about the environment and learns from the consequences of its actions by trial-and-error experiences. It learns the optimal value and/or an optimal policy over time. Interestingly, while based on the MDP formalism, such methods are well suited for changing environments as they can revise their beliefs about the world’s dynamic over time. This is a useful property in animal conditioning experiments.

knowledge of the world’s dynamic (black box) and acquires it through trial-and-error interactions, taking an action a and observing the resulting new state s0 and possible reward r. Model-Based algorithms learn a model (Model learning) of the world from observations from which they infer a value function/policy (Planning). Model-Free algorithms directly learn a value function/policy from observations (Direct RL). These values/policy are used by action selection mechanisms to select the next action.

There are two main categories of RL algorithms, using diﬀerent pathways to achieve the same goal (Figure 2.3). Model-Based algorithms (Section 2.3.2) incrementally learn an internal model of the world by experience from which they can infer values over states that help to guide behaviour. Model-Free algorithms (Section 2.3.1) directly learn these values by experience without relying on some internal model. While these algorithms are usually seen as the full process of learning, planning and acting, they mainly define the process of the two first steps, which can then be combined with various action selection mechanisms (Section 2.3.3) that only need values over state-action pairs to work.

**Model-Free algorithms**

In the following section, we present 3 Model-Free (MF) algorithms (Actor-Critic, Q-Learning and SARSA) that have been extensively linked to instrumental and Pavlovian phenomena. MF algorithms have in common that they rely on the incremental learning of value functions by trial-and-error experiences without the help of an internal model of the world. These algorithms derive from the Temporal Diﬀerence Learning principle.

The Temporal Diﬀerence Learning principle (TD-Learning) [SB87; Sut88] oﬀers a way to estimate the value function over time through experience with the environment. It does not require the knowledge of T or R nor builds a representation of it. It essentially relies on the key concept of Reward Prediction Error (RPE). The RPE signal is the diﬀerence between the current estimation by the agent of the ˆ ˆ function Vt(st), i.e. its expectation, and the value of its last observation rt+γVt(st+1).

Provided the recursive definition of the value function (Equation 2.2) and that R(s, a) and P (s0|s, a) can be approximated by the last observation hrt, st+1i, there should be no diﬀerence (i.e. a null RPE) if the value function ˆ This signal is formally Vt is correct. defined as ˆ ˆ (2.11) (st+1) − δt ← rt + γVt Vt(st) | {z } | {z } V (observation) V (expectation) where rt is the reward retrieved after doing action at in state st and ending in the new state st+1.

#### Extensions of the classical framework – Factored representations

Markov Decision Processes have benefited from various extensions to address very dif-ferent problems, usually leading to new or revised versions of reinforcement learning al-gorithms. MDPs were extended to continuous times and actions [Bai93; Duf95; Doy00]; for partially observable environments (POMDPs) [Jaa+95; Hau00]; to allow factored rep-resentations [Bou+00; Deg+06; VB08]. Another example is the concept of hierarchical RL where problems can be defined at multiple details levels such that one can define sequences of actions as subroutines (options) to be played as one action at a higher level [Sut+99; Bot+09; Bot12; Diu+13]. Most of these extensions have been pushed back into the field of neuroscience and nourished some animal conditioning investigations [Daw+03; Daw+06c; Bot+09; Doy00; RF+11]. However, to our knowledge factored representations for reinforcement learning have been left apart.

The original algorithms in the literature that use factored representations rely on MB learning principles, while in the present work, we develop an algorithm based on MF learn-ing principles. This part of the manuscript will thus describe the original algorithms in a mainly informational manner, in order to understand on which principles we implemented factored representations for MF learning without sticking to the original formalism.

The idea of factorization comes from the necessity to deal with large-scale problems. The standard MDP representation and classical algorithms do not scale well to high dimensional spaces and ends up requiring too much physical space or computation time, a phenomenon named the curse of dimensionality [Bel57]. We illustrate the principle of factorization and associated algorithms through the common CoﬀeeRobot example task [Bou+95; Bou+00], in which, one robot needs to go buy a coﬀee across a street and deliver it back to an employee, earning extra credits if it does not get wet in case of rain.

Real application problems are often described through a set of parameters that describe the current state of the system. Hence, the set of states S can formally be described through a set of random variables X = {X1, . . . , Xn} where each variable Xi can take several values. A state is therefore an instantiation of X. It is also commonly the case that the random variables are binary, that is Xi ∈ [0, 1]. In such a case, states can be defined by the active/present variables in the situation they describe. With factored representations, informations embedded within states are explicitly made available to the algorithms. The CoﬀeeRobot task is described by a set of 6 binary variables: the robot is wet W , the robot has an umbrella U, the robot is in the oﬃce O (outside otherwise), it is raining R, the robot has the coﬀee RC or the employee has it EC. Hence, the set hRC, R, Oi describes the state where the robot has a coﬀee, is in the oﬃce while it is raining, but has not delivered the coﬀee to the owner yet, and is neither wet nor has an umbrella. While simple at first sight, this toy problem already has 26 = 64 states.

Actions are still described as in the standard MDP framework. For example, the robot can go to the next location (Go), buy a coﬀee (buyC), deliver the coﬀee (delC) to the owner and get an umbrella if in the oﬃce (getU).

T and R are also usually redefined to take advantage of the factored representation, describing problems through Factored MDPs [Bou+95; Bou+00; Deg+06; VB08]. Factored MDPs use Dynamic Bayesian Networks [DK89] to define dependencies between variables, combined with compact conditional probability distribution described through trees. The important idea is that some aspects of the task are independent of others. In our example, the fact that it is raining has no impact on the success of delivering the coﬀee (Figure 2.5).

**Habitual versus Goal-Directed behaviours**

It is now well accepted that animals’ behaviours rely on two learning processes when involved in instrumental tasks, one Goal-Directed (GD) and the other Habitual [AD81; Ada82; Fan+13; Ash+10; Tho+10; KH12; Bal+07; BO09; Bro+11]. A behaviour is con-sidered Goal-Directed if it clearly (1) links actions to their consequences and (2) is guided by a desirable outcome, such that it quickly adapts to changing situations or evolutions in its motivational state. A behaviour is considered Habitual when it does not respect the preceding conditions, i.e. it is decorrelated from the expected result of actions, in their consequences or resulting outcomes.

In simple and stable instrumental tasks, Goal-Directed and Habitual behaviours cannot be distinguished, as they produce similar undistinguishable outputs. It has been shown that behaviour usually shifts from Goal-Directed to Habitual with overtraining (exten- sive training on the same task, on multiple days and multiple trials per days) [AD81].

This property has been deeply investigated through outcome devaluation and contingency degradation, two processes (implemented in many diﬀerent ways) that help to distinguish between both phenomena.

**Behavioural expression and Neuronal correlates**

While the transition from Goal-Directed towards Habitual behaviours with overtraining is a standard phenomenon of the literature [AD81; Ada82; DB94; Dic+95; Tri+09; Val+07; CR85; BD98b; KC03; BD98a; CR86; DM89], other experiments showed that overtraining is not decisive to define which process is currently driving the behaviour [KD10; YK06]. At the behavioural level, it has been shown that stress seems to induce Habitual behaviours [SW09; SW11], as limited working-memory capacities [Ott+13] and distractions [Foe+06]. The characteristics of the responses (pulling a chain versus pressing a lever) have also been suggested to elicit diﬀerent behaviours [Fau+05]. At the biological level, a significant number of studies were realized on the importance of the Dorsal Striatum (DS) for instrumental conditioning [BO09]. Using lesions, pharma-cological interventions or brain recordings, studies have shown correlates of Goal-Directed behaviours in Dorsomedial Striatum (DMS) [Boo+09; Wit+09; Gla+10¨; KC03; Wun+12; Yin+05; Ske+14; Fan+13]. In contrast, the expression of Habitual behaviours was corre-lated with the Dorsolateral Striatum (DLS) [Wun+12; Yin+04; Yin+06; YK06; Fan+13; Ske+14]. Lesions studies in the prefrontal cortex could induce a switch between systems [OB05; KC03; Smi+13; MC01; BB00; Rag+02]. Furthermore, Goal-Directed and Habitual capacities have been shown to diﬀer across individuals [Ska+13]. Hence, while Habitual and Goal-Directed behaviours have clearly been identified and shown to rely, at least partially, on diﬀerent brain regions, how one system comes to control the behaviour is still unclear. The current literature suggests a potentially complex integration or competition mechanism.

**Pavlovian-instrumental interactions**

Pavlovian and instrumental conditioning are usually studied separately, assuming that protocols are suﬃcient to solicit one conditioning and not the other. But, while Pavlovian and instrumental conditioning have been shown to rely, at least partially, on diﬀerent brain mechanisms, their complete separation is not so clear [Yin+08; Mee+12; LO12; Mee+10; Mai09].

Some Pavlovian phenomena have counterparts in the instrumental world, for example, recovery phenomena [Nak+00; Tod+12; Bou+12], overexpectation [LN98] or contextual eﬀects [Bou+14; Mar+13]. Some Pavlovian procedures use conditioned stimuli (e.g. lever) that are used as operant objects in instrumental task [Fla+11b]. It is often the case that instrumental tasks embed cues commonly used as conditioned stimuli (e.g. sounds) to inform animals about the diﬀerent phases of the task. Finally, some phenomena clearly emphasize that they can easily interact in a very tight and complex way.

These phenomena have recently been the focus of an increasing number of studies [Lov83; Hal+01; HG03; CB05; CB11; Tal+08; Car+13; Hol+10]. In this section, we briefly present some of the major interaction phenomena (Section 3.4.1), their neural correlates (Section 3.4.2), and the few computational models that have been developed to account for them (Section 3.4.3).

**Table of contents :**

Acknowledgements

List of Figures

List of Tables

French extended abstract / Résumé étendu

Introduction

Observations

Résultats

Discussion

**1 Introduction **

1.1 Motivations

1.2 Objectives

1.3 Methods

1.4 Organization

**2 Reinforcement learning **

2.1 Introduction

2.2 Markov Decision Processes

2.3 Reinforcement learning

2.4 Extensions of the classical framework – Factored representations

**3 Animal conditioning **

3.1 Introduction

3.2 Instrumental conditioning

3.3 Pavlovian conditioning

3.4 Pavlovian-instrumental interactions

**4 Synthesis and working hypothesis **

**5 Modelling individual differences in autoshaping CRs (article) **

Introduction

Results

Discussion

Methods

Supporting Information

**6 Predictions from a computational model of autoshaping (article) **

Abstract

Introduction

Material and methods

Results

Discussion

**7 Model of individual differences in negative automaintenance (article) **

Abstract

Introduction

Methods

Results

Discussion

**8 Discussion **

8.1 Contributions synthesis

8.2 Limits and perspectives

8.3 Concluding remarks

**Publications **

**Bibliography **