Computational complexity of the shapelet discovery

Get Complete Project Material File(s) Now! »

Machine Learning on Time Series

In this chapter, we introduce the time series mining field. We begin with the definitions of the main concepts and notations used in this manuscript. Then, we propose an overview of the main time series mining tasks. Training machine learning models on time series raise issues. We will discuss them and describe two main strategies developed in the literature to overcome them: all time series mining approaches make use of time series representations and distance or similarity measures. Our work, and thus this manuscript, is mainly focused on time series classification.

Definitions & Notations

We introduce here the concept of time series and the notations used in this manuscript. A time series is an ordered sequence of real variables, resulting from the observation of an underlying process from measurements usually made at uniformly spaced time instants according to a given sampling rate [Esling and Agon, 2012]. A time series can gather all the observations of a process, which can result in a very long sequence. A time series can also result of a stream and be semi-infinite.
Machine learning techniques typically consider feature vectors with independent or uncorrelated variables. This is not the case with time series where data are gathered sequentially in time with repeated observations of a system: successive measurements are considered correlated and the time-order is vital [Reinert, 2010]. The correlation across time of the measurements lead to specific issues and challenges in machine learning, in particular to represent the meaningful information, usually encoded in shapes or trends spanning over several points in time.

OVERVIEW OF THE TIME SERIES MINING FIELD

Overview of the time series mining field

The time series mining field can be seen as an instance of the data mining field applied to time series. It involves machine learning techniques to discover automatically meaningful knowledge from sets of time series. Some time series mining tasks, such as classification, perform predictions based on a representation of the time series. We propose in this section an overview of the diﬀerent time series mining tasks.

Motif discovery

To introduce the motif discovery task we must define the concept of subsequence. A subsequence s is a contiguous set of points extracted from a time series Tn;m starting at position i of length l such that:
A motif is a subsequence with specific properties. Motif discovery is the process that returns a motif or a set of motifs from a time series or a set of time series. Several types of motifs exist.
Recurrent motif A recurrent motif is a subsequence that appears recurrently in a longer time series or in a set of time series without trivial matches (overlapping areas) [Lin et al., 2002]. Recurrent motifs can also be defined as high density in the space of subsequences [Minnen et al., 2007], their discovery then consist in locating such regions.
Infrequent or surprising motif & anomaly detection An infrequent or surprising mo-tif is a subsequence that has never been seen or whose frequency of occurrence is significantly lower than other subsequences. Several concepts exist around this def-inition, such as the time series discords [Keogh et al., 2005, Keogh et al., 2007] defined as “subsequences maximally diﬀerent to all the rest of the subsequences (…) most unusual subsequences within a time series” or defined as the detection of previ-ously unknown patterns [Ratanamahatana et al., 2010]. This kind of motif is related with anomaly detection in time series, which aims to discover abnormal subsequences [Weiss, 2004, Leng, 2009, Fujimaki et al., 2009], defined as a motif whose frequency of occurrence diﬀers substantially from expected, given previously seen data [Keogh et al., 2002].
Task-specific relevant motif Motif discovery can be driven with a specific task in mind. For instance, it may be expected from discovered motifs to support a classification task. In this case, a motif or a set of motifs is said to be discriminant of class labels. The time series shapelet [Ye and Keogh, 2009] is one instance of this definition, among others [Zhang et al., 2009].
Many publications discuss the concept of time series motifs with often a proposition of algorithm. Such algorithms usually intend to solve the computational complexity of the motif discovery, since the naive discovery based on exhaustive subsequence enumeration is prohibitive [Keogh et al., 2002, Weiss, 2004, Mueen, 2013]. The precise definition of recur-rent, surprising or relevant is not obvious, many articles propose a criterion or a heuristic to detect motif among subsequences according to their own definition: the definition depends on the task to solve [Ratanamahatana et al., 2010]. A deeper review on motif discovery is performed in section 3.4.

OVERVIEW OF THE TIME SERIES MINING FIELD

The motif is a central concept in time series mining since it allows the capture of temporal shapes.

Time series retrieval

Given a query time series Tn, time series retrieval aims to find the most similar time series in a dataset D based on their information content [Agrawal et al., 1993a, Faloutsos et al., 1994]. The querying can be performed on complete time series or on time series subsequences. In the latter case, every subsequence of the series that matches the query is returned. These two querying approaches are named respectively whole series matching and subsequence matching [Faloutsos et al., 1994, Keogh et al., 2001, Ratanamahatana et al., 2010]. Subsequence matching can be seen as a whole time series matching problem, when time series are divided into subsequences of arbitrary length segments or by motif extraction.
One major issue is the computational complexity of the retrieval. Many approaches have been proposed [Agrawal et al., 1993a, An et al., 2003, Faloutsos et al., 1997], and more recent ones [Wang et al., 2013b, Camerra et al., 2014, Zoumpatianos et al., 2016], based on a combination of time series representation (to reduce the dimensionality of the time series and accelerate the querying process) approximate similarity measures (to quickly discard irrelevant candidates without false dismissal, cf. lower bounding principle) and eﬃcient indexing.
Time series retrieval has caught most of the research attention in time series mining [Esling and Agon, 2012].

Clustering

Time series clustering aims to find groups of similar time series in a dataset. These groups are usually found by assembling similar time series in order to decrease the variance inside groups of similar time series and increase the variance between them. A survey on time series clustering can be found in [Liao, 2005]. The main issue of most clustering approaches is to determine the number of clusters. Since clusters are not predefined, a domain expert may often be required to interpret the obtained results [Ratanamahatana et al., 2010].
Clustering is divided into whole sequence clustering and subsequence clustering [Ratanama-hatana et al., 2010]. In whole sequence clustering, the whole time series are grouped into clusters. When the raw time series are considered, specific distance measures can be used to handle distortions and length diﬀerences with classical clustering algorithms. In subsequence clustering, subsequences are extracted from time series to perform the cluster-ing. Literature suggests that extracting subsequences must be handled with care to avoid meaningless results [Keogh and Lin, 2005]. In particular, subsequence clustering may not provide meaningful results if the clustering is performed on the entire set of subsequences. Relevant subsequences should be extracted from the time series prior to clustering, for instance with a motif discovery algorithm [Chiu et al., 2003].
Time series clustering can be a stage for several other time series mining tasks, such as summarization or motifs discovery algorithm [Keogh et al., 2007].

Temporal pattern mining – Rule discovery

Rule discovery aims to relate patterns in a time series to other patterns from the same time series or others [Das et al., 1998]. One approach is based on the discretization of time series into sequences of symbols. Then association rule mining algorithms are used to discover relationships between symbols (ie. patterns). Such algorithms have been developed to mine association rules between sets of items in large databases [Agrawal et al., 1993b]. [Das et al., 1998], for instance, has extended it to discretized time series. Instead of association rules, decision tree can be learned from motifs extracted from the time series [Ohsaki and Sato, 2002].

Anomaly detection

Anomaly detection is defined as the problem of detecting “patterns in data that do not conform to expected behavior” [Kandhari, 2009]. In a previous paragraph, we mentioned the discovery of motifs on a definition based on abnormal subsequences.

Summarization

A time series is usually long with complex information, an automatic summary may be useful to get insights on the time series. Simple statistics (mean, standard deviation, etc.) are often insuﬃcient or inaccurate to describe suitably a time series [Ratanamahatana et al., 2010]. A specific processing has to be used to summarize time series. For instance, anomaly detection and motif discovery can be used to get a summary of the time series content: anomalous, interesting or repeating motifs are reported. Summarization can also be seen as a special case of clustering, where time series are mapped to clusters, each cluster having a simple description or a prototype to provide a higher view on the series.
Related with the summarization task, visualization tools of time series have been de-veloped in the literature (Time Searcher, Calendar-based visualization, Spiral, VizTree) [Ratanamahatana et al., 2010].

Classification

Time series data also supervised approaches, such as classification. Time series classifi-cation aims to predict a label given a time series in input of a model. The main task consists in the discovery of discriminative features from the time series to distinguish the classes from each other. Time series classification is used to perform for instance pat-tern recognition, spam filtering, medical diagnosis (classifying ECG, EEG, physiological measurements), anomaly or intrusion detection, transaction sequence data in a bank or detecting malfunction in industry applications [Ratanamahatana et al., 2010, Xing et al., 2010].
We can distinguish the weak classification from the strong classification.

RELATIONSHIPS BETWEEN FIELDS

Weak Classification One single class is associated with a time series. It is probably the most common classification scenario in the literature. An example is shown figure 2.6.
Strong Classification Several labels can be used to describe with more granularity a time series. An example is the record of a patient’s health over time: various condi-tions may alternate from wealthy periods to illness along one time series describing a physiological parameter [Xing et al., 2010]. In this case a sequence of labels is associated with a time series instead of one single global label.

Relationships between time series mining, time series analysis and signal processing

READ Historical Highlights and Standardization Initiatives

Other scientific communities are involved in the development of techniques to process and analyze time series, for instance time series analysis and signal processing.
Time series analysis aims to develop techniques to analyze and model the temporal structure of time series to describe an underlying phenomena and possibly to forecast future values. The correlation of successive points in a time series is assumed. When it is possible to model this dependency (autocorrelation, trend, seasonality, etc.), it is possible to fit a model and forecast the next values of a series. Signal processing is a very broad field with many objectives: for instance to improve signal quality, to compress it for storage or transmission or detect a particular pattern.
Time series mining is at the intersection of these fields and machine learning: as we will see chapter 3, the application of machine learning techniques on time series data benefits from feature extraction and transformation techniques designed in other communities.

Time series mining raises specific issues

The tasks presented in the previous chapter are typical of the machine learning field, how-ever time series have specificities that prevent us to apply common approaches developed in the general machine learning literature. The reasons for this are discussed below.

A time series is not a suitable feature vector for machine learning

In general, a raw time series cannot be considered as suitable to feed a machine learning algorithm for several reasons.
In statistical decision theory, a feature vector X is usually a vector of m real-valued random variables such as X 2 Rm. In supervised learning, it exists an output variable Y , whose domain depends on the application (for instance a finite set of values Y 2 C in classification or Y 2 R in regression) such as X and Y are linked by an unknown joint distribution P r(X; Y ) that is approximated with a function f such as f(X) ! Y . The function f is chosen according to hypothesis made on the data distribution and f is fitted in order to optimize a loss function L(Y; f(X)) to penalize prediction errors. The feature vector X is expected to be low-dimensional in order to avoid the curse of dimensionality phenomenon that aﬀects performances of f because instances are located in a sparse feature-space [Hastie et al., 2009]. [Hegger et al., 1998] discusses the impact of high dimensionality to build a meaningful feature space from time series to perform time series analysis: with time series, the density of vectors is small and decreases exponentially with the dimension. To counter this eﬀect an exponentially increasing number of instances in the dataset is necessary. Additionally, the relative position of a random variable in the feature vector X is not taken into account to fit f.
Time series data is by nature high dimensional, thousands points by instance or more is common. A major characteristic of time series is the correlation between successive points in time, with two immediate consequences. First, the preservation of the order of the time series points (ie. the random variables of X) is essential. Secondly, the intrinsic dimensionality of the relevant information contained in a time series is typically much lower than the whole time series dimensionality [Ratanamahatana et al., 2010], since they are correlated, many points are redundant.
To illustrate the specific issues while learning from time series, let’s take a naive ap-proach. We consider the whole time series as a feature vector in input of any classical machine learning algorithm (decision tree, SVM, neural network, etc.). Each point of the time series is a dimension of the feature vector as illustrated figure 2.7.
A given dimension of X is expected to be a random variable with the same distribution across instances of the dataset. It means that time series have to be aligned across the dataset. This strong assumption is hard to meet for several reasons, in particular because of the distortions a time series can suﬀer.
• Time series from various dataset’s instances may not share the same length: the resulting feature vectors would not have the same number of dimensions.
It is usually not possible to change the input size of a machine learning algorithm on-the-fly. Each time series that doesn’t fit the input size would require an adjustment:

TIME SERIES MINING RAISES SPECIFIC ISSUES

• A time series can be multivariate: the previous issues are reinforced in this case.
Beside these issues and the curse of dimensionality, the high dimensionality of the time series is both a computing and a storage challenge. The information contained in a time series may be eﬃciently compressed using an adequate representation of the data. An adequate representation presents several advantages: beyond lower storage and computing requirements, a compact representation of the time series can ease the learning process by highlighting the relevant information and decreasing the dimensionality of the problem. The question of the representation of the time series information is discussed in detail In the next section, we detail the main approaches developed in the literature to train machine learning algorithms from time series data to handle these issues.

Solutions to train machine learning algorithms on time series

Training predictors from time series often requires a quantification of the similarity between time series. A predictor will try to learn a mapping to label or cluster identically similar time series. The issue is on what grounds to decide that time series content is similar. The literature oﬀers many propositions, but as mentioned in [DeBarr and Lin, 2007] there is not one single approach that performs best for all datasets. The main reason resides in the complexity and the heterogeneity of the information contained in time series.
To compare time series, literature usually makes use of two complementary concepts:
• A time series representation transforms a time series into another time series or a feature vector. The objectives are to highlight the relevant information contained in the original time series, to remove noise, to handle distortions and usually to reduce the dimensionality of the data to decrease the complexity of further computations. A representation exposes the features on which time series will be compared.
• A distance measure quantify the similarity between time series, either based on the raw time series or their representations. In the latter case, all the time series must ob-viously share the same representation procedure. Complex distance measures usually aims to handle time series distortions.
Based on these two concepts, two main strategies emerge from the literature to apply machine learning algorithms on time series data. We call them time-based and feature-based approaches.
Time-based approaches consider the whole time series and apply distance measures on the time series to quantify their similarity. A time series representation can be used, it typically produces another time series of lower dimension to decrease the computational requirements.
Feature-based approaches transform the time series into a vector of features. The resulting time series representation is not yet a time series, but a set of features manually or automatically designed to extract local or global characteristics of the time series.
Figure 2.10 illustrates these concepts. The following sections present an overview of the two approaches. Since our work is focused on time series classification, we mainly review time-based and feature-based classification approaches.
Time-based classifiers are suitable when the time series are well segmented on the phe-nomenon to describe so that a matching, performed over the whole time series in the time domain using distance measures, is relevant. This approach has a lot to do with time series retrieval. It has been claimed that one instance of the time-based classification, the nearest neighbor (k N N), is diﬃcult to beat when it is associated with the right distance measure to handle the time series distortions [Batista et al., 2011]. However, this approach assumes that the whole time series are comparable and thus perfectly segmented around the phenomenon of interest. A k N N returns the k nearest time series in a dataset for an unobserved time series in input. Since the time series are associated with a label, the unobserved time series to label is associated with the majority class of its k neighbors. The 1 N N in association with the Euclidean distance [Keogh and Kasetty, 2002] has been shown competitive in comparison with more complex distance measures (for eg. the Dynamic Time Warping -DTW) when the number of instances in the dataset is large [Ding et al., 2008].
Time-based classification requires from all the time series in the dataset to share the same length and the relevant patterns to be aligned. However, some distortions can be handled with suitable distance measures. In fact, most of the literature on time-based approaches focuses on distance measures to improve their capacity to handle misaligned time series and distortions using for instance elastic distance measures [Lines et al., 2014].
In the next paragraphs we propose a brief overview of the distortions a time series can suﬀer and the families of distance measures that have been developed to overcome them.

Table of contents :

1 Introduction
I Learning from Time Series
2 Machine Learning on Time Series
2.1 Definitions & Notations
2.2 Overview of the time series mining field
2.2.1 Motif discovery
2.2.2 Time series retrieval
2.2.3 Clustering
2.2.4 Temporal pattern mining – Rule discovery
2.2.5 Anomaly detection
2.2.6 Summarization
2.2.7 Classification
2.3 Relationships between fields
2.4 Time series mining raises specific issues
2.4.1 A time series is not a suitable feature vector for machine learning
2.5 Train machine learning algorithms on time series
2.5.1 Time-based classification
2.5.2 Feature-based classification
2.6 Conclusions
3 Time Series Representations
3.1 Concept of time series representation
3.2 Time-based representations
3.2.1 Piecewise Representations
3.2.2 Symbolic representations
3.2.3 Transform-based representations
3.3 Feature-based representations
3.3.1 Overall principle
3.3.2 Brief overview of features from time series analysis
3.4 Motif-based representations
3.4.1 Recurrent motif
3.4.2 Surprising or anomalous motif
3.4.3 Discriminant motif
3.4.4 Set of motifs and Sequence-based representation
3.5 Ensemble of representations
3.6 Conclusions
II Our Contribution: a Discriminant Motif-Based Representation
4 Motif Discovery for Classification
4.1 Time series shapelet principle
4.2 Computational complexity of the shapelet discovery
4.2.1 Early abandon & Pruning non-promising candidates
4.2.2 Distance caching
4.2.3 Discovery from a rough representation of the time series
4.2.4 Alternative quality measures
4.2.5 Learning shapelet using gradient descent
4.2.6 Infrequent subsequences as shapelet candidates
4.2.7 Avoid the evaluation of similar candidates
4.3 Various shapelet-based algorithms
4.3.1 The original approach: the shapelet-tree
4.3.2 Variants of the shapelet-tree
4.3.3 Shapelet transform
4.3.4 Other distance measures
4.3.5 Shapelet on multivariate time series
4.3.6 Early time series classification
4.4 Conclusions
5 Discriminant Motif-Based Representation
5.1 Notations
5.2 Subsequence transformation principle
5.3 Motif-based representation
5.4 Conclusions