Time Series Modelling
The statistical analysis of time series has been a challenge for a very long time [Yule1927] and forecasting them is an even older problem as the review by history showed. Indeed, the very wide range of possible applications in modelling time series lead many resear-chers to focus on this domain. Applications range from economics to smart cities, from weather forecasting to physical processes and many more. In this work, we used model-ling mostly to obtain a supervision and to ensure that our learned representations are good. We mostly performed forecasting and generation with a few tasks of classification.
Formally, for a time series x, if we denote xt the series at index t and xt0 1 the t 1 first time step of the series, time series forecasting can be defined as finding the model f such that : = f(x0t 1) + t (2.1).
where t is the one-step ahead forecast error of x, also called the shock of the series at time t. It is also the error made by the forecasting model and should ideally be 0. These models can be used whether or not the time series are of the same length or not.
Because these problems have been studied for a very long time, some methods to solve them have been used for many years. We will focus on the main approaches of forecasting and classification : the Autoregressive (AR) and Moving average (MA) models for forecasting, the tree-based models for forecasting and classification and a variety of models specific to classification. Many of these models use intermediate representation to perform their task. Because these representations are often key to successful classi-fication and forecasting, we will then present the methods used to construct alternate representations.
Autoregressive and Moving Average Models
The concept of stationarity is central in time series forecasting and is especially impor-tant in the case of AR and MA models. Intuitively, it means that the way a time series is generated does not change over time and therefore that if we are able to model the beginning of the series, this model should hold for the rest of it. Formally, a series x is (weakly) stationary if its mean and variance are invariant under translation, meaning that 8t E(xt) = , (where is a constant) and Cov(xt; xt u) is only a function of u.
AR models assume that current values are linearly dependant of past values limited to a period. Using ??, this means that f is a linear combination of a few past values. The number of past values to use is called the period p of the AR model. The parameters of this model are the p weights of the linear combination (plus a bias).
MA models assume that the current value is linearly dependant on the prediction error of our model for the past values. Using ??, the focus is on the shock . This time, the prediction is made with a linear combination of the past shock. The period is noted q, which is the number of time steps to look in the past for this shock. The parameters of this model are the q weights of the linear combination (plus a bias).
The Autoregressive Moving Average (ARMA) model is a combination of both. The current value is a linear combination of the p previous values and the q previous shocks. This model has p + q weights (and a bias).
As such, AR, MA and ARMA models are adapted to stationary time series. They also work best when the time series are not correlated and independent of each other. However, time series are not necessarily stationary.
To overcome this shortcoming, Autoregressive Integrated Moving Average (ARIMA) models were created by box1968some. This model combined the ARMA model and adds an integrating process (integrating in this case is used as the opposite of differentiating). Indeed, one way to make a series stationary is to differentiate it (subtracting the t 1 time step from the t) and losing the first time step in the process.
The ARIMA model takes three hyper-parameters : p, the lag order (how far in the past to look for data) d the degree of differentiation (the number of times the raw observations are differentiated) and q the moving average (how far to look in the past for shock) :
where the and are the learned parameters of the model (the AR part and the MA parts respectively).
Many other variations exist around ARIMA models, like SARIMA (that takes into ac-count seasonality), ARFIMA (that performs more sophisticated differentiation) and also factorisation of all these models (adaptation of these models for multivariate time series). We will not get into more details here. The advantage of models such as the ones presen-ted here is that, once built, forecasting is quite easy. The disadvantage of these models is that they assume linear relations between time steps while some series do not hold this property.
The Prophet tool
The forecasting of time series using models relying on autocorrelation function and parametric models (such as AR models) represents only one approach of time series analysis. When the series cannot be approximate by a linear one, for example when studying climate change, other approaches have to be considered. One such class of models worth noting are based on Generalized Additive Model (GAM). The idea is to decompose the model in several components, each corresponding to different features of the time series to model. The well-known Prophet tool introduced by prophet is based on such decomposition. Time series are decomposed into three components : one account for periodicity (season, month, year, etc), another accounts for holidays (which are not necessarily yearly) and the last component account for non-periodic changes in the time series. The Prophet tool helps analysts forecast time series, mostly by a fine-grained use of seasonality that is refinable by the experts.
Another class of models used in time series is based on trees. Tree based models have the advantage that they can be used for both forecasting and classification.
The most famous example of tree-based model is probably the random forest [random_forest].
This ensemble model combines multiple trees to create a ‘forest’.
Trees [decision_tree] are supervised models that partition data very efficiently. Starting at the first node, at each step, a single aspect of the data is examined and the process goes forth with another node that depends on the result of this examination. The result (and therefore the node transition) can be stochastic or deterministic. They were initially made by hand with nodes representing very clear decision or examination. However, using data and training, the nodes and the transitions associated with them can be learned. Trees can be used in time series to perform either classification or forecasting.
The main advantage of trees is that they are explainable. Each decision can be unders-tood by following the nodes and examination that led to it. The main disadvantage of the trees is that they are often inaccurate as they grossly overfit the data and fail to capture its complexity.
Figure 2.1 – Two examples of trees : one for forecasting, one for classification. On the left a decision tree to perform classification of emails. Credits to decision_tree_image. On the right a regression tree that predicts the ave-rage number of golf players from the weather. Credits to regression_tree.
To remedy this problem, random_forest presented random forests. Random forests combine multiple trees to mitigate their inaccuracy. This, in turn, renders random forest much less explainable. The idea behind random forest is to learn an ensemble of trees that are uncorrelated. The result of the whole model is based on a ‘vote’ of these trees. For classification the random forest outputs the class with the majority of votes. For regression, the random forest output an average of all the trees’ results.
The method to learn decorellated trees is called bagging, for Bootstrap AGGregatING [bagging]. Instead of training random trees on the entire dataset, many subsets of the dataset are sampled (uniformly and with replacement) and a tree is trained for each ‘subdataset’. This reduces the correlation between trees and reduces overfitting in each tree. This process reduces the variance of the model without increasing its bias.
If one (or a few features) are strong predictors for the target, the trees will learn to rely on those features and become correlated. This is why, in addition to the dataset sampling of bagging, random forests also use random subspace method (also called feature bagging) presented by feature_bag. In this process, features are also sampled (uniformly and with replacement) for each subdataset. This means that each tree is trained on a subset of features of a subset of examples. This further reduces correlation between trees.
These two bagging methods allow reducing the variance of the model and therefore overfitting.
Another tree based algorithm is the Gradient Boosted Decision Trees (GBDT) [friedman2001]. Boosting is the method where sequentially connected weak learners are combined to create a strong learner. In this case, the weak learners are trees. The difference with ran-dom forest is that instead of the trees being learned in parallel, they are learned in series. Boosting replaces bagging. In boosting, each tree attempts to minimise the error made by the previous trees. This error is measured by a loss function that allows computing residuals. These residuals are used as the loss to train the next tree. This forces each tree to focus only on the examples that are not being handled well (e.g. misclassified examples). Because the trees are learned sequentially instead of in parallel, training is slower than for random forest. The danger of overfitting is also more important than in random forest. Adding too many trees will increase the risk of overfitting, whereas random forests’ accuracy don’t suffer from more trees. GBDT are, however, more accurate than random forest.
An improvement of GBDT is XGBoost (eXtreme Gradient Boosting) [Chen_2016]. Ac-cording to its creator, Tianqi Chen : ‘The name xgboost, though, actually refers to the engineering goal to push the limit of computation resources for boosted tree algorithms. Which is the reason why many people use xgboost.’ This algorithm is not fundamentally different from GBDT but it adds hardware optimisation and allows for more parallelisation through distributed computing. It also adds some desirable features that improve stabi-lity, training time and accuracy. Amongst them are smart tree pruning and regularisation to decrease the risk of overfitting. It has become very popular in the last few years by winning many machine learning competitions such as Kaggle competitions. At the time of writing, it is the state of the art in many industrial challenges.
Other Algorithms Used for Time Series Classification
There are many other classes of model that can perform forecasting or generation. Too many for us to discuss here. In forecasting, the state of the art of non-deep learning methods lies in AR models and XBGoost. We will briefly mention other state-of-the-art methods for time series classification (that do not use deep learning) inspired by tsc.
time series modelling 15
Another method based on random forest is the Time Series Forest (TSF) [tsf]. Instead of training trees directly on the time series, this algorithm first divides the examples of the dataset into segments of random lengths and random positions. Then, for each segment, a selection of features are computed : the mean, the standard deviation and the slope. The trees are trained using these extracted features.
Another family of approaches that can be used to classify time series is inspired from the dictionary. The main idea of these methods is to pass a sliding window over each series to obtain a constant number of values for each location of the window. Each of these values is transformed into a letter and the concatenation of these letter forms words. The most famous representant of this approaches is the Bag of SFA Symbols (BOSS) method [Schfer2014TheBI] that has been declined in several versions such as BOSS ensemble or Contractable-BOSS.
There are many more approaches to classify time series, such as Random Interval Spec-tral Ensemble (RISE) [RISE] or Shapelet Transform Classifier (STC) [tsc ; bagnall2020tale]. The pattern we see emerging with these types of models is intermediate representations. These models take as input the series and output the result. However, in many cases, the series undergoes some kind of transformation and the decision is made based on this transformed input. Some methods have focused on how to best construct the intermediate representation of a series.
Time Series Representation
Time series can be transformed in many ways. The transformation can be local (only on some parts of the input) or global (on the whole series). The series can be transformed in another series or in primitive values. For example, the simplest method using primitive consists in representing a series by its mean and variance. A slightly more complex method is to represent it by a histogram of its values [histogram].
Frequency analysis is a more complex tool of time series analysis. It can use global or local approaches and results in primitive or time series. It is especially useful when time series are seasonal. Finding the period can make forecasting very easy if the series is strictly seasonal. In addition, if a series is not strictly seasonal, most forecasting models that take seasonality into account will need its seasonality as input.
The most common transformation of series to obtain a frequency representation is the Fourier transform. This approach is based on the assumption that most time series are periodic even if the periodic behaviour is very complex. In any case, a non-periodic function can be considered as a periodic function with an infinite period. Time series are decomposed into a (possibly infinite) weighted sum of trigonometric functions of different periods (also called harmonics). The weight of each harmonic indicates how prevalent a frequency is in the signal. Though not usually used directly for forecasting (except when only relying on seasonality), Fourier transform can be very useful in understanding time series.
Fourier transform only represents the weight of harmonics in the entire time series. There is no way to see that a harmonic is very present in the beginning of a series and absent at the end. The Short-time Fourier transform (STFT) draws inspiration from the Fourier transform. The idea is to divide a series into shorter segments (often overlapping) and apply the Fourier transform to each segment. This reveals the sinusoidal frequency of each local section. The result is a two-dimensional representation (for a univariate time series) with one dimension being the frequency and the other the location (in time) this frequency appeared on. This is often represented as a spectrogram like in ??. One drawback of this method is that the length of the segments (and their overlapping parts) has a high influence on the result and must be chosen carefully. Just as the Fourier transform, it is not used to perform forecasting directly but it gives a richer representation of a signal, that captures both frequency and location information.
A slightly more complex time-frequency analysis is done via wavelets. Just like STFT, Discrete Wavelet Transform (DWT) performs frequency analysis to capture both frequency and location information. Instead of detecting sinusoidal harmonics, DWT uses different kernels that can detect more complex patterns. To perform wavelet transform, a family of wavelets (based on a kernel) is applied to the series. The result is a unidimensional series that represents the weights of the coefficients of the wavelet family as a function of these coefficients. This new representation can then be used to perform forecasting, often with Box-Jenkins models. A representation of the result of a wavelet transform is shown in ?? It has shown great success in domains such as EEG analysis [eeg1 ; eeg2 ; eeg3] where the raw data is extremely hard to interpret but also in other domains such as car sales prediction [car_sales], gas prices forecasting [gas_price], solar and wind energy production [solar] and many others [review_wave].
Deep Learning Approaches
Deep learning approaches aim to learn representations of complex objects, such as time series or images. In a supervised case, the goal of these representations is to obtain better performances on tasks such as classification or forecasting through non-linear features extraction and combination. Algorithms which focus on representation learning as a goal (such as Principal Component Analysis (PCA) and other dimensionality reduction algorithms) that have proved very efficient (even for time series) will be presented in the next section. In this section, we focus on deep learning models that have been used specifically for time series analysis.
Time series forecasting in deep learning is mostly based on two architectures : RNN and CNN. We will first present the RNN and its most common representative : the Long Short-Term Memory (LSTM). As we did not use it in our work, we will succinctly present the CNN before presenting the most recent works that used a combination of both. The goal of these models remains forecasting. However, this forecasting is explicitly performed by learning intermediate representations. They allow us to part from domain knowledge and to learn only from data, without using carefully crafted features designed by experts.
Figure 2.4 – « An example of computing the discrete Haar wavelet coefficients for a sound signal of someone saying : ‘I Love Wavelets.’ The original waveform is shown in blue in the upper left, and the wavelet coefficients are shown in black in the upper right. Along the bottom is shown three zoomed-in re-gions of the wavelet coefficients for different ranges.
In the recent years, the emergence of RNN has changed many things. Their ability to keep the memory of past trends for a long time makes them extremely powerful in time series forecasting.
Because they only need data and almost no expertise on the nature of the time series, they have become very popular. They allow for the analysis and forecasting of time series with almost no domain knowledge. They learn features on their own, replacing the ones made by experts.
RNNs differs from classical neural networks because they have recurrent connections. For each input data, the networks perform the same function. This function, however, depends not only on the input data but also on the results of the previous computation. Instead of having only one output, RNNs have two. One is the ‘classical’ output of the neural network, the other is an internal state that can be used as memory and is trans-ferred to the next input. It is worth noting that the two outputs can have the same value. This mechanic allows them to take into account the temporal nature of the data, which classical neural networks cannot do.
time series modelling 19
RNN can be used in four distinct settings. The one-to-one, the one-to-many, the many-to-one and the many-to-many. The one-to-one is the classical settings for neural networks, where both inputs and outputs are of the same size (for classification of fixed size input). The one-to-many has inputs of fixed size but outputs of varying size. The many-to-one has inputs of varying size but outputs of fixed size (classification of sentences for example). Finally, many-to-many has inputs and outputs of varying sizes. This last mode can be split into two modes depending on the synchronisation. Indeed, there can be a shift between inputs and outputs if, for example, the whole input needs to be read before the outputs can be generated (in translation for example). The ?? illustrates the different modes. The most interesting mode for our work is the many-to-one. It allows representing sequences of varying length into an embedding of fixed dimensions. This embedding can then be used in other models that cannot handle inputs of varying sizes.
Figure 2.6 – « Each rectangle is a vector and arrows represent functions (e.g. matrix mul-tiply). Input vectors are in red, output vectors are in blue and green vectors hold the RNN’s state (more on this soon). From left to right : (1) Simplest mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image classification). (2) Sequence output (e.g. image captio-ning takes an image and outputs a sentence of words). (3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). (4) Sequence input and sequence output (e.g. Machine Translation : an RNN reads a sentence in English and then outputs a sentence in French). (5) Synced sequence input and output (e.g. video classification where we wish to label each frame of the video). No-tice that in every case there are no pre-specified constraints on the lengths sequences because the recurrent transformation (green) is fixed and can be applied as many times as we like.
More formally, RNN cells have a hidden state, usually noted h. A cell is comprised of two weight matrices, Wh and Win and their associated bias ( bin and bh). At each time step, it receives the previous hidden state ht 1 and the input xt. The output of the cell is also its updated hidden state :
ht = tanh(Winxt + bin + Whht 1 + bh) (2.4).
Because of their recurrent nature, RNNs are trained using Backpropagation through time (BPTT). This technique unfolds the RNN to compute the gradient, taking into consideration that the same parameters (Win, bin, Wh and bh) are used for each time step. This algorithm, however, can sometimes lead to very inefficient learning, due to an issue called gradient vanishing.
Gradient vanishing occurs when the BPTT makes the gradient progressively smaller as it goes back in time. This means that the earlier input computations will not meaningfully impact the parameters update. This issue arises because the derivative of tanh is smaller than 1 and the chaining rule will make this decreasing effect compound.
Though less common, gradient exploding is worth mentioning. It is the exact opposite of the previous issue. Instead of becoming too small, gradients become too large. This issue arises when the norm of the weight matrices are too large. It is less common than vanishing gradients and it is also easier to fix and diagnose, and therefore often not considered as important an issue.
Another limitation of the classical RNN (partially due to the vanishing gradient) is that although they can theoretically keep memory of past input for as long as needed, in practice, when the output must depend on an older, distant input, RNNs tend to have forgotten the needed information. As the gap between the time step of the input and the time step where it is needed grows, the RNN becomes unable to connect the information. This connection between older inputs to later outputs is called long-term dependency.
It is to alleviate these two problems of RNNs (long-term dependency and gradient vanishing) that LSTMs were created by LSTM. LSTMs are a gated version of RNNs that allows for the selection of what information to keep or forget from memory and input. This allows them to remember information for a very long time. The difference with classical RNNs is that instead of performing a simple matrix multiplication followed by activation (one layer of neural networks) their internal behaviour are much more complex and contains four internal layers of neural networks. The internal state of a LSTM is also not the same as its output. Some information is kept internally for the next step but not outputed. ?? shows a representation of a cell of a LSTM.
time series modelling 21
input gate decides what to keep from the input. Combined with a tanh layer, it creates a new candidate value for the cell state. The input and forget gates outputs are then used to update the cell state (by forgetting some information through the forget gate and adding information from the input transform by the input gate). Finally, the output gate uses this new cell state in combination with the input to create the output of the cell.
This solves the gradient vanishing problem because the BPTT is done on four different parts of the cell. Specifically, the forget gate’s vector, combined with the addition of four different terms, make it very unlikely that the gradient would tend toward 0. It therefore reduces the probability of vanishing gradient. This mechanism allows LSTM to keep past events in memory for many time steps.
LSTM is one of the most widely used recurrent model [review_lstm]. Its applications range from time series modelling to Natural Language Processing (NLP).
We will focus here on their application in the scope of time series modelling, with some impressive results achieved on classification and forecasting tasks.
LSTMs have been used to classify time series [review_lstm]. One such common appli-cation is the prediction of faults. In turbine, the authors specifically used an LSTM-based framework in order to avoid reliance on experts. They outperformed the traditional me-thods of fault classification in wind turbines by using only raw data time series coming from sensors. A study of how to design flexible fault diagnosis models using LSTM was published by faultdia, with a case study on nuclear energy.
Unlike the historical approaches, LSTMs make no assumptions about the stationarity, the linear dependence or the sequence correlation. This made them very useful in many forecasting tasks where they vastly outperformed traditional benchmarks such as ran-dom forest or standard logistic regression [review_lstm]. They have proved superior to traditional baselines in various applications such as building energy usage prediction [building], carbon emission levels [carbon], health predictive analytics [healthpredictive] and many other [review_lstm]. LSTMs can also be used with input data of various natures. For example, in taxiNY, they combined time series observation with text input in order to predict taxi demand in New York. The method used there can easily be adapted to all sorts of applications that need handling of heterogenous data.
Our goal remains to learn representations. As it is difficult to supervise directly, the use of the forecasting task allows us to have a natural supervision. Indeed, the sequential nature of the data and the model makes the prediction of the next values the most intuitive task. Thanks to their hidden content, LSTM (and RNNs in general) learn representations of the time series. At any given time step, the hidden layer of the network contains information of the input time series thus far.
Hybridisation of CNNs and RNNs
Another architecture that can be used for time series forecasting is CNN. CNNs are neural networks that alternate layers of convolutions and layer of pooling and often end with classic fully connected layers. ?? shows a CNN built to perform image classification.
Table of contents :
1.1 Context and goals of this work
2 related work
3 time series’s generation and forecasting from disentangled representations of contexts
3.1 Related work
3.2 Model .
4 driver and context recognition from can data using disentangled representations
4.1 Driving Behaviour Assessment
4.2 Our Data Driven Approach Based on CAN Data
5.1 Where we began
5.2 Where we are now