Catastrophic forgetting in Continual Learning
Artificial neural networks learn to perform a task (e.g. the process of categorizing a given set of data into classes) by finding an “optimal” point in the parameter space. When ANNs subsequently learn a new task (e.g. the process of categorizing a new set of data into a new class), their parameters will move to a new solution point that allows the ANNs to perform the new task. Catastrophic forgetting McCloskey and Cohen (1989); French (1999) arises when the new set of parameters is completely inappropriate for the previously learned tasks. The latter is mainly a consequence of not taking into consideration previously learned tasks. The gradient descent algorithm adapts, unless regularized in some way, all parameters of an artificial neural network to the new task without considering previous tasks. Catastrophic forgetting is related to the stability-plasticity dilemma French (1997); Abraham and Robins (2005) which is a more general problem in neural networks. Learning models require both plasticity to learn new tasks and stability to prevent forgetting previously learned tasks.
In continual learning, the objective is to overcome the catastrophic forgetting problem by looking for a trade-off between stability and plasticity. For instance, a fully stable system could transform all new information into lifelong memories and learn new things until a memory budget is filled. Therefore, it is necessary to distinguish valuable memories from useless ones and, consequently, the plasticity would help to forget what is not crucial. There are very rare cases of an “unlimited” memory budget in humans who can remember an uncannily large number of experiences; not without adverse effects Van Bree (2016).
The catastrophic forgetting problem in ANNs has been addressed in cognitive sciences since the early 90’s McCloskey and Cohen (1989); Robins (1995). The latest development of deep neural networks has led to a higher interest in this field. This challenge is now addressed, with no particular distinction, as continual learning Shin et al. (2017); Parisi et al. (2019), sequential learning McCloskey and Cohen (1989); Aljundi et al. (2018), lifelong learning Rannen et al. (2017); Aljundi et al. (2017); Chaudhry et al. (2018b) and incremental learning Rebuffi et al. (2017); Chaudhry et al. (2018a). In general lines, all of them aim at learning new tasks from a continuous stream of data without forgetting previous tasks. For clarity and simplicity, we will use the expression continual learning. The final goal in continual learning is to employ the tasks learned in the past to help future problem-solving.
The easiest way to overcome catastrophic forgetting in continual learning is to learn new training samples jointly with old ones to avoid forgetting previously seen patterns. In this way, the best and most straightforward solution is to store all the previously seen samples; however, this solution is unrealistic for three main reasons: i. large memory footprint requirements are often impracticable for real applications or edge devices, ii. privacy issues are usually a concern when storing raw proprietary data, iii. complete retraining of each new set of incoming data can be infeasible on large scales. Although it is possible to overcome catastrophic forgetting by replaying only a certain amount of previously learned examples, the amount of stored examples plays a critical role and privacy issues remain a problem. Alternatively, it is also possible to replay synthetic samples from a data generator instead of storing examples. Regardless of the provenance of the old samples, a more challenging and open question is which samples should be replayed to improve the performance in continual learning.
Early steps to alleviate catastrophic forgetting have mainly focused on replaying old activity patterns while learning new data. The hippocampal-neocortical system has, to some extent, inspired this general concept. It consists in replaying what the neural network might have learned in the past using as input (i) an input stimulus and as output (o) the activation of the network stimulus. The input-output activation patterns represent the knowledge of the neural network in a stable state (i.e. before being updated). In addition, the replay of the input-output activation patterns contains enough information to prevent the ANN from catastrophically forgetting previously learned tasks (Robins (1995); Ans and Rousset (1997); French (1997); Li and Hoiem (2017); Rebuffi et al. (2017); Buzzega et al. (2020)). These early attempts to reduce catastrophic forgetting have shown that the replaying of the input-output activation patterns can successfully alleviate catastrophic forgetting not only in classification tasks but also in the continual learning of mathematical operations Ans and Rousset (2000).
Following the resurgence of neural networks in 2012, the problem of catastrophic forget-ting in continual learning has received increased attention. Consequently, these early strate-gies have given rise to several current research efforts that focus primarily on improving the process of knowledge retrieval from input-output activation patterns (Liu et al. (2020); Shim et al. (2020)). Despite all the current advances and broader strategies, there is still an important open question in this area that was flagged up several years ago French (1999):
“How best to optimize the input stimulus used to recover information. Are there ways to improve ‘quality’ of the input stimulus so that they better reflect the originally learned regularities in the environment?”. This thesis attempts to answer this question through an extensive study.
Focus of the thesis
It is well established that replaying, as previously acquired knowledge, what the neural net-work have learned through an input stimulus and its corresponding activation output helps to alleviate catastrophic forgetting. Among the strategies employed in the literature, it is possible to identify three widely extended ones to acquire the input stimuli. The first one is a buffer strategy that stores a portion of learned examples to be used later to capture old knowledge through input-output activation patterns. The second strategy consists in generating the synthetic samples by modeling the input distribution. Both strategies im-plicitly assume that samples that resemble the input distribution are necessary for optimal activation patterns. The third strategy consists in capturing the input-output activation patterns through random stimuli. However, only a few works have focused on an optimal model to improve the knowledge retrieval process through input-output activation patterns. Without focusing specifically on the structure of input stimuli, this work concentrates on finding a model with an optimal architecture to improve the knowledge retrieval process in continual learning tasks.
Content of the thesis
The main goal of this thesis is to study the catastrophic forgetting problem in artificial neural networks and to propose strategies to alleviate this problem. The primary motivation is to allow artificial neural networks to accumulate previously learned tasks over time and to forget what is not crucial. This is achieved by considering the mapping function of the input-output activation patterns (i.e. knowledge) as a basis for building a “memory” without relying solely on what ANNs have previously seen. The second objective of this thesis is to obtain a model that is agnostic with respect to the dataset and with respect to what is and will be learned. That is to say, a model that does not depend on external information but simply on acquired knowledge. In parallel, our goal is to improve the knowledge retrieval process of the input-output activation patterns to optimally consolidate the learned patterns into lifelong memories.
In Chapter 2, we provide a brief introduction to neural networks and autoencoders. In particular, we present the main concepts of generative models that allow us to distinguish be-tween our contributions and the generative models. Then, we present one of the transversal tools of our work: the process of knowledge distillation and transfer. Next, we present some trends in continual learning and the corresponding jargon. Finally, we describe learning workflows, frequently used evaluation metrics and the datasets used in this thesis.
In Chapter ??, we take a comprehensive look at the state-of-the-art of catastrophic for-getting since its inception until today. We place continual learning approaches in an Atlas trying to identify clusters of approaches with similar solutions. We analyze 12 clusters found with seven characteristics of continual learning and we detail their limitations, challenges and utilities. We point out some side effects that have emerged in recent years when question-ing the difference between catastrophic forgetting, continual learning and their evaluation methods. Next, we outline possible avenues for exploring continual learning.
In Chapter ??, we demonstrate that an autoencoder can remember and generate what it has learned. We also discuss its subsequent linkage to an auto-associative memory. In particular, we provide mathematical proofs and empirical evidence to show the memory capacity of autoencoders. Then, we present an iterative sampling process called reinjection that allows us to sample the training distribution learned by the autoencoder. We build a workflow with its algorithm to exploit the generated samples to transfer knowledge from a trained neural network to an untrained one. Finally, we extend the autoencoder property to a hybrid model that replicates and classifies a given input. We empirically demonstrate that the hybrid model retains the autoencoder property and generates input-output activation patterns (i.e. samples with corresponding labels) useful for knowledge transfer and continual learning.
In Chapter ??, we present a dual-memory system for continual learning. Specifically, we pair two hybrid models; the first one learns a new task while the second one generates and captures the previous knowledge with input-output patterns. During the continual learning of the new task, the generated input-output activation patterns are replayed and serve to alleviate forgetting. This dual memory framework provides a specific data-free system to alleviate catastrophic forgetting in ANNs with pseudo-samples (i.e. synthetic samples). This solution is suitable for applications where privacy is essential. After analyzing its strengths and limitations, we propose to endow this model with a memory buffer that yields a Combined replay solution. Combined replay combines the rehearsal and the pseudo-rehearsal methods by exploiting the strengths of the hybrid model. In this way, the knowledge retrieval (i.e. the generation of the input-output patterns) process is improved and the forgetting is effectively reduced. This chapter also shows the difficulties in adequately improving the knowledge retrieval process in ANNs and it proposes an effective solution for continual learning problems.
In Chapter ??, we present a study that is the culmination of the experiences launched throughout this thesis and the previous chapters. During the consolidation process, knowl-edge is often captured with samples from the distribution, but it is unclear what kind of samples to use. This chapter is a longitudinal study on the question where is the knowedge in continual learning? which we try to answer empirically. Precisely, this chapter consists in using several different sources of samples during the consolidation step based on five hy-potheses. In trying to answer what information is beneficial to consolidate and to capture during the consolidation process, we point out which samples are beneficial for continual learning.
In Chapter ??, we conclude with the main contributions and the perspectives for ex-tending the results and understanding of this thesis.
List of publications and patents
PUBLICATIONS The results presented in these doctoral theses and complementary works are summarised in the following contributions.
• Generalization of iterative sampling in autoencoders Solinas et al. (2020).
• Beneficial effect of combined replay for continual learning Solinas et al. (2021).
• Impact of Spatial Frequency Based Constraints on Adversarial Robustness Bernhard et al. (2021).
• Dream Net: a privacy-preserving continual learning model for facial emotion recog-nition. (Workshop in International Conference on Affective Computing & Intelligent Interaction 2021).
• Impact of reverberation through deep neural networks on adversarial perturbations (IEEE International Conference on Machine Learning and Applications 2021).
• Beneficial effects of reinjections for continual learning. (SN Computer Science)
• Where is the knowledge in continual learning? (Work completed, pending approval by the patent committee).
• Brevet_1 : A data-free transfer knowledge mechanism for neural networks
• Brevet_2 : A data-free continual learning mechanism to alleviate catastrophic forget-ting
• Brevet_3 :Iterative sampling for an anomaly detection mechanism
– Continual learning survey: past, present and future.
– Brevet_4 : Improving the generation of synthetic data for continual learning
In supervised learning, a training dataset D = (xi; yi) represents a mapping function (F : xi ! yi) defined by observations xi and their corresponding labels yi , which are sampled i.i.d. from a distribution Px;y. Then, a neural network is employed to learn a mapping function f that approximates F . The objective of a neural network is to correctly match inputs xi to a target output yi by adapting its parameters . For example, in a digit classification problem, xi consists of digit images while yi corresponds to the digit category.
In two steps, a neural network classifier learns a mapping function y = f(x; ) and adapts parameters that minimize the error between model predictions and the ground truth.
The first step is called feedforward propagation or inference. Input information x flows through the neural network, being multiplied by the intermediate computations to the output. In most cases, the mapping f(x) : x ! y is a function aggregation described as follows: f(x) : g3 g2 g1(x) where gl are intermediate layers connected in a chain y = f(x) = g3(g2 (g1(x))). The intermediate layers are parametrized by a weight vector l that is employed to weigh the input before being transformed by an activation function.
At each hidden layer, the activation function transforms the weighted sum of the inputs into an output. The hidden layers can also contain bias parameters usually employed to shift the weighted sum of the input. A hidden layer is often represented as gl(x) = ’( l; x) where l is the paremeter of the layer and ’ is the activation function of the layer. A neural network comprises an input layer g1, which is the first layer of the network, an output layer g3 and several intermediate layers. The last layer, the output layer, is where the prediction is expected. The total length of the chain gives the depth of the model and it is where the term deep learning comes from.
The training data provided comprises observations of F evaluated at different points. Each sample xi is accompanied by a label yi = F (xi) that represents the desired output for the output layer. Altogether, the label specifies what the output layer must do at each observation. However, the output behavior of the intermediate layers is not directly specified by the training data, so they are called hidden layers.
The second step is called training and consists in backpropagating an error signal over the model parameters Riedmiller and Braun (1993 ). A loss functions is used to compute the error made by the model about their predictions; that is, the loss is high when the model is doing a poor job and low when it is performing well. The error signal is the value of the difference between the predicted labels and the desired true labels. Then, the error value is used to adjust the model parameters to reduce the prediction discrepancy. The gradient of the loss function with respect to the previous layers is employed as the update rule to update the parameters of each layers. The learning process is repeated until convergence, for which an optimizer iteratively computes the gradient on batches randomly sampled from the training set. When a neural network finds an “optimal” set of parameters, it is ready to be deployed and and to make predictions on unseen examples.
A wide range of features allows ANNs to learn f to find a valid mapping between inputs and outputs. For example, to evade the linear constraints, non-linear functions ’ are used in the hidden layers instead of linear transformations because they provides higher degrees of freedom that enable models “to understand” the non-linear relationships between the examples x and the corresponding labels y. Non-linear transformations allow transforming non-linear separable problems into linear separable ones. More specifically, non-linear functions allow the input space to be folded so that space can be divided into small linear regions Pascanu et al. (2013). The non-linear transformation, the right loss function and an appropriate number of parameters in the hidden layers allow a neural network to approximate any mapping function. This is the origin of the term universal approximator Hornik et al. (1989). For a detailed introduction to neural networks, please refer to Goodfellow et al.
The learned mapping function f( ; x) is continually updated and evaluated over time in continual learning. When evaluating the performance of the classifier, what is judged behind the scenes is the degree of degradation of the mapping function concerning the original F mapping functions. Ideally, the mapping function would not degrade over time, but it does so due to catastrophic forgetting. Therefore, metrics showing the evolution of f( ; x) over learning steps allow us to identify how well a continual learning approach maintains the desired mapping function.
While a classifier neural network learns a mapping function defined by the observations and their corresponding labels, autoencoders aim at replicating the input at the output layer. The mapping function is self-defined by the input samples and the autoencoder aims to obtain an output similar to its input. An autoencoder is often comprised of an encoder part e(x) that maps the input x into a code z and a decoder part d(x) that maps the code c into the replicated input. Therefore, an autoencoder is usually represented as with two mapping functions x^ = d(e(x)), first from input to code z = e(x) and then from code to replications x^ = d(z) where x^ is the replicated vector. In this way, autoencoders are neural networks that minimize the following loss function over the input data and its replication as in Equation 2.1. L(x; x^) = (x log(d(e(x))) + (1 x) log(1 d(e(x)))) (2.1) where x is the input vector of the training distribution and x^ is the predicted output of the autoencoder. Note that in the simplified equation, vector values are evaluated, so 1 repre-sents a vector value of the same dimension as the input. Depending on the input distribution and the model architecture, autoencoders can be trained with several loss functions (e.g. mean squared error). Unless indicated otherwise, along this work, we employ the binary cross-entropy to train the autoencoders.
There is not a unique recipe for autoencoders and its utilization is broader and covers many applications. For example, regarding the dimension of the latent code z on the autoencoders, it is possible that the latent code defines compaction or dilation of the input information. Some regularized autoencoders also include prior knowledge to structure the shape of the latent code while improving the replications and increasing generative capa-bilities. However, autoencoders are not limited to utilizing a latent code or to the classic encoding-decoding behaviour. As long as neural networks replicate inputs into outputs and the loss function is valuated between inputs and replications, it is possible to consider an auto-encoder neural network.
In chapter 3, we revisit some other characteristics of autoencoders that go beyond the encoding-decoding process. These characteristics are exploited later in chapters 4 to build a continual learning solutions.
In machine learning, real-world data may not be accessible for various reasons (e.g. privacy, lack of data, memory footprint issues, etc. ). In these situations, generative models can then be trained to model a given training distribution and be used to synthesize artificial data as variational auto-encoders Kingma and Welling (2013) and adversarial auto-encoder Goodfellow et al. (2014a) do. In this section we present variational auto-encoders and adversarial auto-encoders, two popular generative autoencoders that are quickly revisited in chapter 3.
Variational Auto-Encoder aims to regularize the latent space of the auto-encoder by encouraging the latent space to shape a target distribution (e.g. Gaussian distribution N (0; 1) ). In this way, the input data is densely located in specific areas of the latent space. An interesting feature of VAEs is that the latent space is continuous and complete Spinner et al. (2018) and it allows sampling of the latent space data to generate samples through the decoding part.
A variational auto-encoder is an auto-encoder with a major difference in the encoder’s part. While the auto-encoder simply encodes the input data into a latent variable z, the variational auto-encoder encodes the input data into a prior distribution (i.e. the target distribution). To do so, the encoder part of the variational auto-encoder is trained to minimize the Kullback-Leiber divergence between an encoded distribution Q and the target distribution P . Given the two distributions P and Q defined on the same probability space X, the KL divergence indicates the amount of information that is lost when using Q to represent P as in Equation 2.2 DKL(QkP ) = xX=X Q(x) log Q(x) (2.2) P (x)
To set up the encoded distribution with mean and variance parameters, the encoder outputs two vectors: a mean vector and the standard deviation vector . To sample z from and , the encoder uses a scale transformation of the distribution: it randomly samples from N (0; 1) and computes z from , and as in Equation 2.3. z = : + (2.3)
While the KL divergence between Q and P encourages every data sample to fit the target distribution, the reconstruction error allows the model to differentiate the data points from different zones. For instance, all the samples in the latent space will overlap into the target distribution if the reconstruction loss is not minimized. In this way, a right balance between reconstruction and regularization allows a variational auto-encoder to compress the information in a structured latent space and to act as a generative model. A variational auto-encoder is usually trained to minimize the loss described in Equation 2.4. LV AE = BCE(x; x^) + DKL(N ( (x); (x))kN 0; 1) (2.4) where BCE corresponds to Equation 2.1 used in a classic autoencoder. In this particular case, the target distribution is a N (0; 1); however, it can be replaced by a multivariate normal distribution N ( ; ) or more complex distributions.
Another way of obtaining a structured latent space is through an adversarial model. The combination of an auto-encoder with an adversarial loss function is known as Adversarial Auto-Encoder (AAE). Both VAEs and AAEs follow the same objective and implement variational inference; however, they differ in how they impose a prior on the latent space. In the case of AAEs, the prior is learned through an adversarial model.
Adversarial examples are a particular case of samples that poses a significant problem ad-dressed by a very active research community as shown by the recent explosion in the number of published articles in the field of adversarial machine learning. According to Goodfellow et al. (2014b), adversarial examples are “inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence”.
When training a classifier neural network, is easy to think of the mapping function F as task decision boundaries given by the dataset (xi; yi) 2 D. As a classifier, it must correctly classify the samples of the dataset and it can do so by building decision boundaries that approximate the task boundaries (i.e. f(x) F (x)). A trained neural network yield an optimal solution when the task decision boundaries and the model decision boundaries meet enough to correctly classify inputs into the right classes. One common explanation for adversarial examples is based on the usually imperfect matching between task decision boundaries and model decision boundaries. Adversarial examples would be crafted to take advantage of this imprecision of the classifiers. Note that, even if the imprecision is small, it always exists; so, it is always possible to craft adversarial examples if an attacker has access to the model. Although such a crafted sample may not be perceived as modified, the model treats it as a sample of an incorrect class.
Adversarial attacks aim to find a tiny perturbation , often constrained by a norm (e.g. l1 , l2, l1), and then add that perturbation to a legitimate input xi 2 D = fx; y gNi=1 to craft an adversarial example. The adversarial example is x0i = xi + that, in terms of distances, it is quite close to the legitimate input xi but it is classified differently by the neural network f(xi) 6= f (x0i). For instance, a simple adversarial example can be obtained by employing the following perturbation of Equation 2.5.
Table of contents :
1.1 Catastrophic forgetting in Continual Learning
1.2 Focus of the thesis
1.3 Content of the thesis
1.3.1 List of publications and patents
2.2 Neural Networks
2.2.2 Adversarial examples
2.2.3 Feature extraction
2.3 Continual learning
2.3.1 Multi-head vs Single-head settings
List of Figures