Get Complete Project Material File(s) Now! »

## Training and Regularizing Deep Neural Networks

In this section, we first introduce the general learning framework of Deep Learn-ing (DL) before a focus on the Convolutional Neural Network (ConvNet) archi-tectures. We will then go over common techniques to improve the generalization abilities of those models.

Parameters W Optimization # » Input Prediction Training loss model of pairs (X, Y) Target Class = $ ✓ Street ✘ Office Y Y ✘.

Figure 2.1. – General overview of Machine Learning (ML) training. Using exam-ples from a dataset, the predictive model learns to make the correct predictions by minimizing a training loss measuring the error made by the model.

### Deep Learning framework

Machine Learning (ML). ML is a broad domain proposing models that learn to solve a task from examples used to improve themselves. In this thesis, we work mostly on models trained to predict a semantic label. For example, given pictures of cats and dogs, we can learn a model that distinguishes them. Let us go over a typical process used to train an ML model, represented by Figure 2.1.

ML proposes to train a model f, of parameters w 2 W, taking an input x 2 X to produce a prediction y^. Knowing the ground-truth label y 2 Y associated to x, we can quantify the prediction error of the model by defining a loss function Ltask(y^; y). Our goal is therefore to find the optimal parameters w that minimizes the expectation of the loss: w = argw (xE;y) Ltask(y y) w (xE;y) hLtask f w(x) y i : (2.1) min ^; = arg min ; To solve this optimization problem, we use a dataset D = f(x(i); y(i)); i = 1 : : : Ntraing on which we can sample pairs (x; y) used to estimate the expectation with Monte Carlo sampling. We then use an optimization algorithm to minimize this empirical loss over the dataset: w = argw Ntrain hLtask w(x ) y i (2.2) min X f (i) ; (i) : Deep Learning (DL). DL is a subset of ML models using Deep Neural Net-works (DNNs), initially inspired by a simple modeling of the neurons proposed by McCulloch and Pitts (1943). Most of the time, we use feed-forward Neural Net-works (NNs), where the model f is a succession (more precisely a directed acyclic graph) of mathematical transformations called layers transforming x in a succes-sion of representations h‘ for each layer ‘. The most common layers are dense layers, which consist in a linear transformation of the input h‘ = w‘h‘ 1 + b‘; and non-linear activation layers, that can be any function making the model non-linear. Nowadays we mostly use Rectified Linear Unit (ReLU) (max(0; h)) activation, but hyperbolic tangent (tanh) or sigmoid (eh=eh+1) remain popular options, and many more exist (Nwankpa et al. 2018). Thanks to their depth, i.e. number of layers, DNNs are able to transform raw input data into more and more complex repre-sentations, and thus perform representation learning (Bengio et al. 2013), where the model learns by itself what are the most interesting features to model the input data for the task at hand.

Neural Networks (NNs) are trained using gradient back-propagation (Rumel-hart et al. 1988). This allows to compute progressively, using the chain-rule, the gradient rwL of the loss L with respect to all the weights w. Using a gradient descent algorithm, we can then update the weights in a direction that decrease the value of the loss, so that progressively, over the course of the training, we finally reach a minimum of the objective function: w w rwL : (2.3).

Numerous gradient descent algorithms exist, the simplest one being Stochastic Gradient Descent (SGD) (Léon Bottou 2010), with variants designed to improve the speed of the convergence as well as finding a better minimum, since DNNs training losses are non-convex and lots of local minima exist. Famous methods include SGD with momentum (Rumelhart et al. 1988), RMSProp (Hinton et al. 2012), AdaDelta (Duchi et al. 2011) or Adam (Kingma and Ba 2015).

#### Convolutional architectures

As we have seen, Deep Learning (DL) became particularly popular for Computer Vision (CV) in 2012 when AlexNet (Krizhevsky et al. 2012) won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). This model is a Convolutional Neural Network (ConvNet), a type of NN that is specially designed for CV tasks. A typical ConvNet, such as VGG-16 (Simonyan and Zisserman 2015) represented in Figure 2.2, is composed of convolutional layers used in place of most or even all of the dense layers of a traditional DNN. Indeed, applying a 2D convolution to an image allows to process only small and local patches of information, regardless of their position in the image. Thus, at the beginning of the network, convolutions only look for small patterns in the input image. When going deeper in the network, the use of pooling layers (or by adding stride in a convolution) progressively aggregates the spatial information, bringing closer the information of the patterns found by previous layers. Thus, the next convolutions have a larger receptive field (Luo et al. 2016) and can assemble the small patterns into bigger and more semantic ones. This behavior of convolutions can be observed by investigating how trained ConvNets represent the information, as studied by Olah et al. (2017) and previously illustrated in Figure 1.2 (page 3). We can see that the model first detects contours, assembled into textures, shapes and objects to finally produce the semantic prediction. Interestingly, it can also be noted that the visual cortex is said to work in a similar fashion to this succession of convolutions and pooling (Hubel and Wiesel 1962).

The first ConvNets trained by back-propagation and designed for CV dates back decades ago, for example with LeNet-5 (LeCun et al. 1998) which classifies hand-written digits. However, as we mentioned, it is only recently that they became the state-of-the-art approach for CV. The first notable network is AlexNet (Krizhevsky et al. 2012), designed to classify natural images of ImageNet (Russakovsky et al. 2015), that won ILSVRC. In the next years, numerous new ConvNet architectures were proposed and won this competition. Popular architectures include VGG-16.

**Regularizing DNNs with priors**

We have seen that DNNs are trained to find the optimal parameters in order to best predict the labels of the samples in the training dataset D. However, a model that perfectly predicts those labels does not necessarily produce the best results on unseen data. For example, if fw can model a very complex function compared to the number of samples in D, the model can learn by heart the labels y(i) associated to x(i) without being able to generalize to new samples (Vapnik and Chervonenkis 1972), which is called overfitting.

Because of their complexity, DL models are highly subject to this risk of over-fitting the training set while lacking generalization capabilities on the test set. Indeed, they are known to be universal approximators (Lu et al. 2017) and can possibly produce overly complex decision boundaries. Since the beginning of the development of DNNs, techniques to control the training of these models were developed.

To overcome this issue, we use regularization which can take multiple forms, a common one being to add a new loss term regul(w; x; y) describing preferred solutions, for example, simpler or smoother decision functions (Vapnik 1992). Instead of using Equation 2.2, we thus have: w L D (x;y)2D h L i (2.4) min ( ; w) = E task fw(x); y + regul(w; x; y) : | {z } complete.

**Architectural changes**

By changing the structure of the model f, it is possible to introduce many priors and make the model behave in more desired ways. This can be done through the choice of the architecture’s building blocks, the design of explicitly invariant models, and the addition of particular layers like dropout and BN.

Architecture design. The choice of basic layers used in a model is a first way to introduce prior knowledge in the model. The type of layers, their number, organi-zation, sizes, etc., are all factors that are chosen based on prior knowledge about the complexity of the learning problem and how to solve it. In particular, convo-lutional layers can be seen as an “infinitely strong prior” (Goodfellow et al. 2016, chapter 9) because they force very sparse connections between the neurons of the input and output representations of the layer. Convolution and max-pooling also add respectively equivariance and local invariance properties toward translation.

If we know factors to which representations should be invariant, this knowl-edge can also be explicitly embedded in the architecture. For example, Mallat (2012) and Bruna and Mallat (2013) use properties of the wavelet scattering to obtain invariance toward various types of transformations; Dieleman et al. (2015) propose a ConvNet that is invariant to rotations using an approach similar to DA done in parallel; Cohen and Welling (2016) define convolutional and pooling layers that are equivariant to mathematical groups of geometric transformation; Mehr et al. (2018) propose an Auto-Encoder (AE) that is explicitly invariant to the pose of a 3D object, etc. Of course, those approaches can be very powerful when factors to which invariance is important are well known, but this is rarely the case. For example, in the context of natural image recognition, lots of variability exist in the shape, texture, positions, scales of the objects; variations that are complicated to model explicitly.

Masking connections. In order to better deal with the large number of connec-tions, i.e. weights, that a DNN has, Srivastava et al. (2014) propose to randomly remove some connections, sampled differently for each batch during training. This method is called dropout and was shown to be effective on various DNNs and data (image, text, speech). The interpretation of this is that dropout prevents the co-adaptation of the neurons, encouraging their independence and producing more robust representations. Another interpretation is that this random effect makes the model behave as an ensemble of many models that are averaged when using the model for predictions. Variations of this were also proposed such as DropConnect (Wan et al. 2013) or DropBlock (Ghiasi et al. 2018) to refine the idea. Similar ideas also propose to add sparsity to connections of existing archi- tecture, such as L. Zhu et al. (2018) who propose to remove residual connections in ResNets.

**Table of contents :**

**1 introduction **

1.1 Context

1.2 Motivations

1.3 Contributions and outline

1.4 Related publications

**2 deep neural networks for image classification: training, regularization and invariance **

2.1 Introduction

2.2 Training and Regularizing Deep Neural Networks

2.3 SHADE: Encouraging Invariance in DNNs

2.4 Conclusion

**3 separating discriminative and non-discriminative information for semi-supervised learning **

3.1 Introduction

3.2 Reconstruction and Stability for Semi-Supervised Learning

3.3 HybridNet framework

3.4 Experiments

3.5 Conclusion

**4 dual-branch structuring of the latent space for disentangling and image editing **

4.1 Introduction

4.2 Related work

4.3 DualDis approach

4.4 Discussion

4.5 DualDis evaluation

4.6 Semi-Supervised Learning

4.7 Image Editing and Data Augmentation

4.8 Conclusion

**5 conclusion **

5.1 Summary of Contributions

5.2 Perspectives for Future Work

**bibliography **