Get Complete Project Material File(s) Now! »

## Deep Learning

Conventionally, only in this thesis, we understand Deep Learning as deep Neural Network Model (NNM) or Convolutional Neural Network (CNN) model because Deep Learning can have other deep architectures which diﬀer from Neural Network Model.

C.1 Neural networks. Let 1 l L denote index of layer in a deep neural network and let n(l) denote the number of features of hidden representation z(l), which means z(l) 2 Rn(l) . In layer (l), z(l) is computed as follows:

C.2 Convolutional neural networks. Since a vanilla neural network model by equation II.27 is composed only of Fully Connected layers, it has several following shortcomings. Firstly, units in a signal vector or pixels in an image are independent, which means do not take into account the specific topological structure (here the locality) among these units or pixels. Secondly, for a deep vanilla NNM, using Fully Connected layers increases quickly the number of model parameters, which can lead to an over-fitting problem. Finally, Fully Connected layer (as filter) is not robust to some transformations, such as translation. In order to overcome the above limitations, [Fukushima 1980] proposed Convolutional Neural Network (CNN) and over the years, it was improved and refined in by [Lecun et al. 1998]. For a simple and general view, we present CNN in case of images

Let consider a square hidden representation: z(l) 2 RK(l) H(l) H(l) , where K(l) means the number of channels, H(l) means both height and width in the layer (l). A convolutional layer with parameters (W (l); b(l)) is thus a transition between layer (l) and (l + 1). W (l) consists of K(l+1) square convolutional filters, each filter has size of K(l) I(l) I(l) where I(l) I(l) is also called kernel size. In short, W (l) 2 RK(l+1) K(l) I(l) I(l) . The bias is represented by b(l) 2 RK(l+1) . After performing K(l+1) convolutions, the output z(l+1) has K(l+1) channels or shortly z(l+1) 2 .

where k = 1; ::; K(l+1). Figure II.4 shows an illustration for a convolutional operator in CNNs. In first CNN models, the architecture consists of convolutional layers and Max-Pooling layers (alternatively) at the beginning, then a Flatten layer followed by a Fully Connected layers at the end (figure II.5).

Figure II.5: An example inspired from LeNet architecture. Source: developed from [Stutz 2016]. In practice, the size of conventional images is often hundreds times hundreds pixels (even more in some specific fields) and the number of categories (classes) can reach to the thousands, instead of 32 32 with 10 categories in figure II.5. Therefore, we need to increase the capabilities of CNN models, usually by going deeper or by widening the number of channels. Hence we obtain architectures with higher complexity. One can think of broadening kernel sizes. However, this may lead very quickly to an immense number of model parameters and may not be eﬃcient for conventional images. For NNMs and also CNNs that have a high complexity, two classical problems need to be considered:

– Vanishing gradient. This problem relates to deep architecture models that use gradient-based methods as optimizer. In this case, backward gradients in upstream layers are very small, which implies that parameters in these layers are almost fixed during the training stage. For example, activation functions such as tanh or sigmoid have their derivatives in the range ( 1; 1) so that when backward gradients are computed using the chain rule, they get fainter as one moves towards upstream layers.

– Over-fitting. This problem relates to neural network models that have a large number of model parameters compared to number of training samples. Although training set and test set are sampled from the same data distribution, a model may display a low training loss and a high testing loss.

Solutions for the vanishing gradient problem though various CNN architectures will be de-tailed in section II.C.3 (Standard CNNs). The over-fitting problem is tackled through various regularization strategies:

– Through the data, e.g. data input augmentation (rotation, translation, flip, color transfor-mations,. . . ). A review of data augmentation field can be found in [Shorten & Khoshgoftaar 2019]. The regularization might also be applied to the outputs using techniques such as Label Smoothing [Szegedy et al. 2015] that proposed to modify slightly the target, e.g. from [1,0,0] to [0.8,0.1,0.1]. Instead of manually modifying the target label, an other technique to perform Label Smoothing is to slightly maximize entropy of predicted output [Pereyra et al. 2017]. Note that, Label Smoothing is applied only for labelled samples.

– Through the model parameters, e.g. weights decay [Krogh & Hertz 1992] or other strategies which are presented in section IV.A.2.

– Through the optimization scheme, e.g. Early Stopping [Morgan & Bourlard 1990, Prechelt 1997].

– (l+1)

**Deep Learning **

C.3 Standard CNNs. In this subsection, we present some popular CNN architectures, which are considered as the backbone of many applications. Figure II.6 gives a simple illustrations for these CNN architectures.

AlexNet. This is an architecture proposed by [Krizhevsky et al. 2012] that revives CNN after the work of [Lecun et al. 1998] (LeNet). It has several improved points compared to LeNet. First, AlexNet is deeper and has much more number of parameters than LeNet (60M vs 60K). The training has been made possible using GPUs. Second, AlexNet uses the non-saturating ReLU activation function [Nair & Hinton 2010]: max(0; x), which showed more eﬃcient than tanh or sigmoid function. This is because the derivative of ReLU is 1 if x > 0 instead of a value in ( 1; 1) as derivatives of tanh or sigmoid. Thus ReLU is better to deal with the vanishing gradient problem and provides even a faster learning. In addition, ReLUs is sparse if x 0 and sparse representations seem to be more eﬃcient for regularization than dense representations.

VGG. This approach [Simonyan & Zisserman 2014] aims at making improvements over AlexNet. First, it is deeper and larger than AlexNet (about 2 times in term of number of parametric layers and in term of number of parameters). Second, large kernel sizes (11 11 or 5 5) in somes first convolutional layers of AlexNet is replaced by multiple 3 3 ones, which helps to learn better the texture of data.

Inception. Also known as GoogleNet [Szegedy et al. 2014a], this architecture appeared almost at the same time as VGG. Again, it aims at making improvements over AlexNet by several following significant modifications. Firstly, 1 1 convolutional filters is applied to reduce the number of channels before applying larger convolutional filters. For a short comparison, let take an example that we want to pass from a hidden representation z(l) of 32 channels to the next one z(l+1) of 64 channels with 3 3 convolutional filters, then we need W (l) of size 64 32 3 3 = 18432 parameters. On the contrary, if we apply 1 1 convolutional filters, we firstly pass from z(l) of 32 channels to an intermediate representation of 16 channels by W0(l+1) of size 16 32 1 1 = 512 parameters. Then we pass from the intermediate representation of 16 channels to z of 64 channels by W1(l+1) of size 64 16 3 3 = 9216 parameters. Consequently, we need only 9728 parameters in this case instead of 18432 in the previous case and this reduces considerably the number of parameters.

**Table of contents :**

**I Introduction **

A Context

B Learning paradigms

C Motivations

D Manuscript outlines

**II Methodological pillars **

A Dictionary learning

A.1 Sparse coding

A.2 Dictionary update

A.3 Conclusion on dictionary learning

B Supervised dictionary learning

B.1 SDL with internal classifier

B.2 SDL with atoms discriminative

B.3 Conclusion about SDL

C Deep Learning

C.1 Neural networks

C.2 Convolutional neural networks

C.3 Standard CNNs

C.4 Optimizers

D Manifold learning

**IIISemi-supervised dictionary learning **

A Introduction

A.1 Generalities

A.2 Related works

B Proposed method

B.1 Construction of objective function

B.2 Optimization

B.3 Numerical experiments

B.4 Conclusion about proposed method

C Conclusion

**IV Semi-supervised deep learning **

A Related works

A.1 Notations

A.2 Auxiliary task as regularization

A.3 Pseudo labeling

A.4 Generative models

A.5 Virtual Adversarial Training

A.6 Holistic methods

A.7 Partial conclusion for semi-supervised neural networks

B Manifold attack

B.1 Individual attack point versus data points

B.2 Attack points as data augmentation

B.3 Pairwise manifold learning

B.4 Settings of anchor points and initialization of virtual points

C Applications of manifold attack

C.1 Manifold learning on a small dataset

C.2 Robustness to adversarial examples

C.3 Semi-supervised manifold attack

C.4 Conclusion about manifold attack

V General Conclusion