Get Complete Project Material File(s) Now! »

## Training points design

We discuss here the choice of the training measure and the sets (n)n=0;:::;N1 used to compute the numerical approximations in Regression Monte Carlo and Quantization. Two cases are considered in this section. The first one is a knowledge-based selection, relevant when the controller knows with a certain degree of confidence where the process has to be driven in order to optimize her reward functional. The second case, on the other hand, is when the controller has no idea where or how to drive the process to optimize the reward functional.

### Exploitation only strategy

In the knowledge-based setting there is no need for exhaustive and expensive (in time mainly) exploration of the state space, and the controller can directly choose training sets constructed from distributions that assign more points to the parts of the state space where the optimal process is likely to be driven.

In practice, at time tn, assuming we know that the optimal process is likely to stay in the ball centered around the point mn and with radius rn, we chose a training measure n centered around mn as, for example N(mn; r2 n), and build the training set as sample of the latter. In the Regress- Later setting this can be done straightforwardly, while Control Randomization requires one to select

#### Numerical methods

a measure for the random control such that the controlled process Z is driven in such area of the state space. Taking samples according to to build grids makes them random. Another choice, which we used in the Quantization-based algorithm, is to use the (deterministic) optimal grid of N(mn; 2 n) with reduced size (typically take 50 points for a problem in dimension 1, 250 for one of dimension 2 when 2 n = 1,. . . ), which can be found at www.quantize.maths-fi.com, to reduce the size of the training set and alleviate the complexity of the algorithms.

Remark 4.3.5. As the reader will see, we chose the training sets based on the “exploitation only strategy” procedure, i.e. by guessing where to drive optimally the process, when solving the Liquidation Problem introduced in Subsection 4.4.1.

**Explore first, exploit later**

Explore first: If the agent has no idea of where to drive the process to receive large rewards, she can always proceed to an exploration step to discover favorable subsets of the state space. To do so, the n, for n = 0; : : : ;N 1, can be built as uniform grids that cover a large part of the state space, or can be chosen uniform on such domain. It is essential to explore far enough to have a well understanding of where to drive and where not to drive the process.

Exploit later: The estimates for the optimal controls at time tn; n = 0; : : : ;N 1, that come up from the Explore first step, are relatively good in the way that they manage to avoid the wrong areas of state space when driving the process. However, the training sets that have been used to compute the estimated optimal control are too sparse to ensure accuracy on the estimation. In order to improve the accuracy, the natural idea is to build new training sets by simulating M times the process using the estimates on the optimal strategy computed from the Explore first step, and then proceed to another estimation of the optimal strategies using the new training sets. This trick can be seen as a two steps algorithm that improves the estimate of the optimal control.

Remark 4.3.6. In Control Randomization, multiple runs of the method are often needed to obtain precise estimates, because the initial choice of the dummy control could drive the training points far from where the optimal control would have driven them. In practice, after having computed an approximate policy backward in time, such policy is used to drive M simulations of the process forward in time, which in turn produce control paths that can be fed as a random controls in a new backward procedure, leading to more accurate results. Remark 4.3.7. We applied the “explore first, exploit later” idea to solve the Portfolio Optimization problem introduced in Subsection 4.4.1.

**Optimal control searching**

Assume in this section that we already have the estimates bVt(tk; ) for the value function at time tk, for k = n + 1; : : : ;N, and want to estimate V (tn; ) the value function at time tn.

The optimal control searching task consists in optimizing the function6 bQ n : (z; ) 7! f(z; a)t + ^Ea n;z bVt(tn+1;Ztn+1) over the control space A, for each z 2 n, and where we denote by ^Ea n;z bVt(tn+1;Ztn+1) an approximation of Ea n;z bVt(tn+1;Ztn+1) using Regress-Later, or Control Randomization or Quantizationbased methods (see Subsection 4.3.2). Once again, we remind that importance of this task is motivated by the dynamic programming principle stating that for all n = 0; : : : ;N 1, we can approximate the value function at time n as follows bVt(tn; z) = sup a2A bQ n(z; a).

**High cardinality/continuous control space**

If we assume differentiability almost everywhere, as follows from the semi-linear approximation in Quantization, and most choices of basis functions in Regression Monte Carlo, we can carry on the optimization step by using some gradient-based algorithm for optimization of differentiable functions. Actually, many optimizing algorithms (Brent, Golden-section Search, Newton gradient-descent,. . . ) are already implemented in standard libraries of most programming languages like Python (see, e.g., package scipy.optimize), Julia (see, e.g., package Optim.jl), C and C++ (see, e.g., package NLopt). Remark 4.3.9. When the control is of dimension 1, polynomials of order smaller than 5 are employed as basis functions in Regression Monte Carlo as well as for the running reward f. The optimal control can then be computed analytically as a function of the regression coefficients, since every polynomial equation of order smaller than 4 can be solved by radicals. Concretely, in all the examples considered in Section 4.4, we used the Golden-section Search or the Brent methods when testing Quantization-based algorithm to find the optimal controls at each point of the grids. These algorithms were very accurate to find the optimal controls, and we made use of Remark 4.3.9 to find the optimal controls using the Regress-Later-based algorithm.

**Table of contents :**

Abstract

**1 Introduction **

1.1 Quantization and regression methods for MDPs. Applications to market-making and McKean-Vlasov-type control problem with partial information

1.1.1 Quantization and regression methods for Markov decision processes (MDPs).

1.1.2 Application of the MDPs to market-making (Chapter 3)

1.1.3 Application of the MDPs to McKean-Vlasov type control problems (Chapter 4)

1.2 Neural networks for MDPs

1.2.1 Representation of functions by neural networks & training

1.2.2 Neural networks for MDPs

1.2.3 Contributions (Chapters 5 and 6)

1.2.4 Perspectives

1.3 Neural networks for PDEs and BSDEs

1.3.1 General framework

1.3.2 Numerical methods available in the literature

1.3.3 Contributions (Chapter 7)

1.3.4 Perspectives

**2 Introduction (in French) **

2.1 Méthodes de Quantification et de régression pour les MDPs. Application au marketmaking et aux problèmes de contrôle de type McKean-Vlasov avec information partielle 30

2.1.1 Méthodes de Quantification et de régression pour les problèmes de décisions Markoviennes (MDP)

2.1.2 Application des MDPs au market-making (Chapitre 3)

2.1.3 Application des MDPs aux problèmes de contrôle de type McKean-Vlasov avec information partielle (Chapitre 4)

2.2 Réseaux de neurones pour les MDPs

2.2.1 Représentation de fonctions par réseaux de neurones & apprentissage

2.2.2 Revue de littérature sur les réseaux de neurones pour les MDPs

2.2.3 Contributions (Chapitres 5 et 6)

2.2.4 Perspectives

2.3 Réseaux de neurones pour les EDPs et les EDSRs

2.3.1 Cadre général

2.3.2 Méthodes numériques existantes dans la littérature

2.3.3 Contributions (Chapitre 7)

2.3.4 Perspectives

**I Quantization and regression-type methods for MDPs. Application to market making and Mckean-Vlasov control problems. **

**3 Algorithmic trading in a microstructural limit order book model **

3.1 Introduction

3.2 Model setup

3.2.1 Order book representation

3.2.2 Market maker strategies

3.3 Existence and characterization of the optimal strategy

3.3.1 Definition and well-posedness of the value function

3.3.2 Markov Decision Process formulation of the market-making problem

3.3.3 Proof of Theorem 3.3.1

3.4 Numerical Algorithm

3.4.1 Framework

3.4.2 Presentation and rate of convergence of the Qknn algorithm

3.4.3 Qknn agorithm applied to the order book control problem (3.3.1)

3.4.4 Numerical results

3.5 Model extension to Hawkes Processes

3.A From uncontrolled to controlled intensity

3.B Dynamics of the controlled order book (simplified version)

3.B.1 Dynamics of Xt et Yt

3.B.2 Dynamics of the at et bt

3.B.3 Dynamics of nat and nbt

3.B.4 Dynamics of pa and pb

3.B.5 Dynamics of ra and rb

3.C Proof of Theorem 3.4.1 and Corollary 3.4.1

3.D Zador’s Theorem

**4 A class of finite-dimensional numerically solvable McKean-Vlasov control problems**

4.1 Introduction

4.2 Polynomial McKean-Vlasov control problem

4.2.1 Main assumptions

4.2.2 Markovian embedding

4.3 Numerical methods

4.3.1 Value and Performance iteration

4.3.2 Approximation of conditional expectations

4.3.3 Training points design

4.3.4 Optimal control searching

4.3.5 Upper and lower bounds

4.3.6 Pseudo-codes

4.4 Applications and numerical results

4.4.1 Portfolio Optimization under drift uncertainty

4.4.2 A model of interbank systemic risk with partial observation

4.5 Conclusion

**II Deep learning for Markov decision processes (MDPs) **

**5 Deep neural networks algorithms for stochastic control problems on finite horizon: convergence analysis **

5.1 Introduction

5.2 Preliminaries on DNN and SGD

5.2.1 Neural network approximations

5.2.2 Stochastic optimization in DNN

5.3 Description of the algorithms

5.3.1 Control learning by performance iteration

5.3.2 Control learning by hybrid iteration

5.3.3 Training sets design

5.3.4 Comparison of the algorithms

5.4 Convergence analysis

5.4.1 Control learning by performance iteration (NNcontPI)

5.4.2 Hybrid-Now algorithm

5.4.3 Hybrid-LaterQ algorithm

5.5 Conclusion

5.A Localization

5.B Forward evaluation of the optimal controls in AM

5.C Proof of Lemma 5.4.1

5.D Proof of Lemma 5.4.2

5.E Function approximation by neural networks

5.F Proof of Lemma 5.4.3

5.G Proof of Lemma 5.4.4

5.H Some useful Lemmas for the proof of Theorem 5.4.2

**6 Deep neural networks algorithms for stochastic control problems on finite horizon: numerical applications **

6.1 Introduction

6.2 Algorithms

6.2.1 Control Learning by Performance Iteration

6.2.2 Control and value function learning by double DNN

6.2.3 Quantization with k-nearest-neighbors: Qknn-algorithm

6.3 Numerical applications

6.3.1 A semilinear PDE

6.3.2 Option hedging

6.3.3 Valuation of energy storage

6.3.4 Microgrid management

6.4 Discussion and conclusion

**III Deep learning for BSDEs and PDEs **

**7 Some machine learning schemes for high-dimensional nonlinear PDEs **

7.1 Introduction

7.2 Neural networks as function approximators

7.3 Deep learning-based schemes for semi-linear PDEs

7.3.1 The deep BSDE scheme of [HJE17]

7.3.2 New schemes: DBDP1 and DBDP2

7.3.3 Extension to variational inequalities: scheme RDBDP

7.4 Convergence analysis

7.4.1 Convergence of DBDP1

7.4.2 Convergence of DBDP2

7.4.3 Convergence of RDBDP

7.5 Numerical results

7.5.1 PDEs with bounded solution and simple structure

7.5.2 PDEs with unbounded solution and more complex structure

7.5.3 Application to American options

**Bibliography **