Temporal pattern mining – Rule discovery
Rule discovery aims to relate patterns in a time series to other patterns from the same time series or others [Das et al., 1998]. One approach is based on the discretization of time series into sequences of symbols. Then association rule mining algorithms are used to discover relationships between symbols (ie. patterns). Such algorithms have been developed to mine association rules between sets of items in large databases [Agrawal et al., 1993b]. [Das et al., 1998], for instance, has extended it to discretized time series. Instead of association rules, decision tree can be learned from motifs extracted from the time series [Ohsaki and Sato, 2002].
Relationships between time series mining, time series analysis and signal processing
Other scientific communities are involved in the development of techniques to process and analyze time series, for instance time series analysis and signal processing.
Time series analysis aims to develop techniques to analyze and model the temporal structure of time series to describe an underlying phenomena and possibly to forecast future values. The correlation of successive points in a time series is assumed. When it is possible to model this dependency (autocorrelation, trend, seasonality, etc.), it is possible to fit a model and forecast the next values of a series. Signal processing is a very broad field with many objectives: for instance to improve signal quality, to compress it for storage or transmission or detect a particular pattern. Time series mining is at the intersection of these fields and machine learning: as we will see chapter 3, the application of machine learning techniques on time series data benefits from feature extraction and transformation techniques designed in other communities.
A time series is not a suitable feature vector for machine learning
In general, a raw time series cannot be considered as suitable to feed a machine learning algorithm for several reasons.
In statistical decision theory, a feature vector X is usually a vector of m real-valued random variables such as X 2 Rm. In supervised learning, it exists an output variable Y , whose domain depends on the application (for instance a finite set of values Y 2 C in classification or Y 2 R in regression) such as X and Y are linked by an unknown joint distribution P r(X; Y ) that is approximated with a function f such as f(X) ! Y . The function f is chosen according to hypothesis made on the data distribution and f is fitted in order to optimize a loss function L(Y; f(X)) to penalize prediction errors. The feature vector X is expected to be low-dimensional in order to avoid the curse of dimensionality phenomenon that aﬀects performances of f because instances are located in a sparse feature-space [Hastie et al., 2009]. [Hegger et al., 1998] discusses the impact of high dimensionality to build a meaningful feature space from time series to perform time series analysis: with time series, the density of vectors is small and decreases exponentially with the dimension. To counter this eﬀect an exponentially increasing number of instances in the dataset is necessary. Additionally, the relative position of a random variable in the feature vector X is not taken into account to fit f.
Time series data is by nature high dimensional, thousands points by instance or more is common. A major characteristic of time series is the correlation between successive points in time, with two immediate consequences. First, the preservation of the order of the time series points (ie. the random variables of X) is essential. Secondly, the intrinsic dimensionality of the relevant information contained in a time series is typically much lower than the whole time series dimensionality [Ratanamahatana et al., 2010], since they are correlated, many points are redundant.
To illustrate the specific issues while learning from time series, let’s take a naive ap-proach. We consider the whole time series as a feature vector in input of any classical machine learning algorithm (decision tree, SVM, neural network, etc.). Each point of the time series is a dimension of the feature vector as illustrated figure 2.7.
A given dimension of X is expected to be a random variable with the same distribution across instances of the dataset. It means that time series have to be aligned across the dataset. This strong assumption is hard to meet for several reasons, in particular because of the distortions a time series can suﬀer.
• Time series from various dataset’s instances may not share the same length: the resulting feature vectors would not have the same number of dimensions.
It is usually not possible to change the input size of a machine learning algorithm on-the-fly. Each time series that doesn’t fit the input size would require an adjustment:
• Time series measure the occurrence of a phenomenon: the recording must have been performed with a perfect timing or a posterior segmentation of the time series is required to isolate the phenomenon from raw measurements to obtain a feature vector X where each dimension samples the same phenomenon’s stage across instances. For instance, if a set of time series store city temperatures, the first point of each time series must share the same time-stamp such as “January mean temperature” and so on. [Hu et al., 2013] highlights this issue: “literature usually assumes that defining the beginning and the ending points of an interesting pattern can be correctly identified while this is unjustified”. Also, the phenomenon of interest may be a localized subsequence in a larger time series. This subsequence may be out of phase across instances and appears at random positions in the time series. Figure 2.8 illustrates this point. The red motif, shifted, would appear in diﬀerent positions in the feature vector. Motif discovery is a complex task and dedicated approaches are needed with processing steps to form a proper feature vector.
Solutions to train machine learning algorithms on time series
Training predictors from time series often requires a quantification of the similarity between time series. A predictor will try to learn a mapping to label or cluster identically similar time series. The issue is on what grounds to decide that time series content is similar. The literature oﬀers many propositions, but as mentioned in [DeBarr and Lin, 2007] there is not one single approach that performs best for all datasets. The main reason resides in the complexity and the heterogeneity of the information contained in time series.
To compare time series, literature usually makes use of two complementary concepts:
• A time series representation transforms a time series into another time series or a feature vector. The objectives are to highlight the relevant information contained in the original time series, to remove noise, to handle distortions and usually to reduce the dimensionality of the data to decrease the complexity of further computations. A representation exposes the features on which time series will be compared.
• A distance measure quantify the similarity between time series, either based on the raw time series or their representations. In the latter case, all the time series must ob-viously share the same representation procedure. Complex distance measures usually aims to handle time series distortions.
Based on these two concepts, two main strategies emerge from the literature to apply machine learning algorithms on time series data. We call them time-based and feature-based approaches.
Time-based approaches consider the whole time series and apply distance measures on the time series to quantify their similarity. A time series representation can be used, it typically produces another time series of lower dimension to decrease the computational requirements.
Feature-based approaches transform the time series into a vector of features. The resulting time series representation is not yet a time series, but a set of features manually or automatically designed to extract local or global characteristics of the time series.
Figure 2.10 illustrates these concepts. The following sections present an overview of the two approaches. Since our work is focused on time series classification, we mainly review time-based and feature-based classification approaches.
Table of contents :
I Learning from Time Series
2 Machine Learning on Time Series
2.1 Definitions & Notations
2.2 Overview of the time series mining field
2.2.1 Motif discovery
2.2.2 Time series retrieval
2.2.4 Temporal pattern mining – Rule discovery
2.2.5 Anomaly detection
2.3 Relationships between fields
2.4 Time series mining raises specific issues
2.4.1 A time series is not a suitable feature vector for machine learning
2.5 Train machine learning algorithms on time series
2.5.1 Time-based classification
2.5.2 Feature-based classification
3 Time Series Representations
3.1 Concept of time series representation
3.2 Time-based representations
3.2.1 Piecewise Representations
3.2.2 Symbolic representations
3.2.3 Transform-based representations
3.3 Feature-based representations
3.3.1 Overall principle
3.3.2 Brief overview of features from time series analysis
3.4 Motif-based representations
3.4.1 Recurrent motif
3.4.2 Surprising or anomalous motif
3.4.3 Discriminant motif
3.4.4 Set of motifs and Sequence-based representation
3.5 Ensemble of representations
II Our Contribution: a Discriminant Motif-Based Representation
4 Motif Discovery for Classification
4.1 Time series shapelet principle
4.2 Computational complexity of the shapelet discovery
4.2.1 Early abandon & Pruning non-promising candidates
4.2.2 Distance caching
4.2.3 Discovery from a rough representation of the time series
4.2.4 Alternative quality measures
4.2.5 Learning shapelet using gradient descent
4.2.6 Infrequent subsequences as shapelet candidates
4.2.7 Avoid the evaluation of similar candidates
4.3 Various shapelet-based algorithms
4.3.1 The original approach: the shapelet-tree
4.3.2 Variants of the shapelet-tree
4.3.3 Shapelet transform
4.3.4 Other distance measures
4.3.5 Shapelet on multivariate time series
4.3.6 Early time series classification
5 Discriminant Motif-Based Representation
5.2 Subsequence transformation principle
5.3 Motif-based representation
6 Scalable Discovery of Discriminant Motifs
6.1 An intractable exhaustive discovery among S
6.2 Subsequence redundancy in S
6.3 A random sub-sampling of S is a solution
6.4 Discussion on j ^ Sj the number of subsequences to draw
6.5 Experimentation: impact of random subsampling
7.1 Discovery as a feature selection problem
III Industrial Applications
8 Presentation of the industrial use cases
8.1 Context of the industrial use cases
8.1.1 Steel production & Process monitoring
8.1.2 Types of data
8.1.3 Industrial problematic formalization
8.2 Description of the use cases
8.2.1 1st use case: sliver defect, detection of inclusions at continuous casting
8.2.2 2nd use case: detection of mechanical properties scattering
9 Benchmark on the industrial use cases
9.1 Experimental procedure
9.1.1 Feature vector engineering for the time series
9.1.2 Learning stack
9.1.3 Classification performance evaluation
9.2.1 Classification performances
9.2.2 Illustration of discovered EAST-shapelets
9.3 Computational performances
10 Conclusions & Perspectives