consistent online time series clustering algorithm

Get Complete Project Material File(s) Now! »

Stationary ergodic framework

The main and only statistical assumption that we make is that each time series is generated by a stationary ergodic process distribution. Intuitively, stationarity means that the time index does not bear any information. That is, the time index at which the observations have started to be recorded is not important. Ergodicity means that asymptotically, any realization of the time series reveals complete information about its distribution. The assumption that a process distribution is stationary ergodic is one of the weakest assumptions in statistics. Indeed, by the ergodic decomposition, any stationary process can be represented as a mixture of stationary ergodic processes. This means that any realization of a stationary process can be considered a realization of a stationary ergodic process (Billingsley, 1961). By this argument, we can safely state that the setting considered in this thesis is characterized by no other assumption than stationarity, which is in turn extremely mild. Note that we make no such assumptions as independence within the samples in each sequence, or between the different sequences.
Moreover, the process distributions need not satisfy any mixing conditions: the data may have long-range dependence, and infinite memory.
Clearly this framework, however general, is still not assumption-free and cannot possibly address all real-world applications, as not all time series are expected to be modelled by stationary ergodic process distributions. For example, consider time series corresponding to video recordings of human locomotion. While the act of “running”
can be thought of as an ergodic motion, a “single jump” can not. Note, however, that the stationary ergodic assumption is still rather mild as compared to the statistical assumptions typically made in the literature. As a result, this framework can be suitable for a broader range of applications. At the same time, this general setting calls for a very different kind of analysis. The methods designed for the more restrictive settings typically considered in the literature, make extensive use of rates of convergence. In contrast, rates of convergence (even of frequencies to respective probabilities) are impossible to obtain for stationary ergodic process distributions; see for example, (Shields, 1996). This entails such strong impossibility results as the non-existence of consistent tests for homogeneity. Specifically, as shown in (Ryabko, 2010b), given two samples generated by stationary ergodic process distributions, it is impossible to distinguish between the case where they are generated by the same source or by two different ones; this result holds even if the given samples are binary-valued. Due to these restrictions, for many statistical problems it is unclear whether consistent solutions even exist in this setting. Thus, one of the main challenges when considering the stationary ergodic framework consists in finding the appropriate problem formulations that, without the need for stronger assumptions, can admit consistent solutions. Moreover, the theoretical guarantees obtained for algorithms involving this class of processes, cannot be beyond asymptotic results. As a result, the algorithms developed for this framework are forced not to rely on rates of convergence. A possible downside of this scenario is that the asymptotic results obtained for these methods cannot be strengthened and in particular, rates of decrease of error cannot be obtained. On the bright side, however, this limitation may be seen as an advantage in the sense that the rate-free methods designed to work in this general framework are potentially applicable to a much wider range of real-world scenarios.

Proposed formulations and main results

In this section we give an overview of the proposed formulations along with a brief description of our results. A summary of our main contributions is also provided in Section 1.5. In light of the theoretical impossibility results discussed above, not every statistical problem involving time series can be expected to admit a consistent solution in the stationary ergodic framework. Some examples of the problems that, as follows from the results of (Ryabko, 2010b), are provably impossible to be solved in this setting include time series clustering for unknown number of clusters, change point detection, and change point estimation when the number of change points is unknown. In this thesis we consider two classical problems namely, change point estimation and online time series clustering, introduced in Sections 1.2.1 and 1.2.2 respectively. For each problem we provide appropriate formulations that, as we demonstrate, admit asymptotically correct solutions under the assumption that the data are stationary ergodic. An overview of our formulations follows.

Change point estimation

Change point estimation involves an a posteriori analysis of a given heterogenous time series, as a means to locate the point in time where the process distribution that generates the samples has abruptly changed. This is an important tool in many practical applications. Indeed, solutions to many real-world problems are directly or indirectly related to the inference of changes in distribution of sequential observations. A number of application areas include bioinformatics (Picard et al., 2005, Vert and Bleakley, 2010), financial modeling (Talih and Hengartner, 2005, Lavielle and Teyssiere, 2007), network traffic data analysis (Tartakovsky et al., 2006, Lévy-Leduc and Roueff, 2009, Lung-Yut-Fong et al., 2012), fraud and anomaly detection (Bolton and Hand, 2002, Akoglu and Faloutsos, 2010) and speech segmentation (Fox et al., 2011).
The problem of change point estimation may be introduced as follows. A given sequence x := X1; : : : ;Xn is formed as the concatenation of +1 non-overlapping segments where the consecutive segments are generated by different process distributions.
The index where one segment ends and another starts is called a change point. Specifically,a change point refers to the index at which the process distribution of x has changed. Thus, x has change points, which are unknown and have to be estimated. In this work we consider the problem of change point estimation in the general stationary ergodic paradigm discussed in Section 1.1. Our only assumption is that each segment is generated by some (unknown) stationary ergodic process distribution. The joint distribution over the samples can be otherwise arbitrary. The marginal distributions of any given fixed size (e.g. single-dimensional marginals) before and after a change point are allowed to be the same. For example the mean, variance etc. may remain unchanged throughout the entire sequence. This means that we consider the most general and perhaps the most natural notion of change, that is, change in the distribution of the data. We also do not make any conditions on the densities of the marginal distributions (the densities may not exist). Since little assumptions are made on how the data are generated, this setting is especially suitable for highly dependent time series, potentially accommodating a variety of new applications.
The objective is to construct algorithms that simultaneously estimate all change points in x. The algorithms must be asymptotically consistent in the sense that their estimation error on all change points must be arbitrarily small if the sequence is sufficiently long. Note that the asymptotic regime is with respect to the length n of the sequence. In other words, the problem is considered offline and the sequence does not grow with time. As follows from the theoretical impossibility results discussed earlier, the number of change points cannot be estimated under these general assumptions.
Usually in the literature, stronger statistical assumptions are imposed on the process distributions so that with the aid of rates of convergence it would be possible to estimate . However, rates of convergence are not available in this setting; thus, a natural question would be whether there exist formulations for which consistent solutions can be obtained, without the need to place any restrictions (beyond the stationarity and ergodicity) on the process distributions. To address this question, we consider three different formulations of change point problem, for each of which we provide a consistent solution; our methods are presented in Chapter 3. An overview of our formulations follows.
1. In the first formulation, we assume that the correct number of change points is known and provided to the algorithm. The objective is to have an algorithm that is asymptotically consistent in the general statistical framework described above. This scenario may arise in many applications. For example, the objective of a medical study may be to investigate the causes of seizures, using electroencephalography (EEG) recordings of patients. The number of seizures that each patient has suffered may be known a priori, and the goal may be to identify the changes in the brain activity as reflected by the EEG recordings. Another example involves searching within a long text, or a long speech signal, in order to identify the points at which the subject of discussion, the writers or the speakers have changed. In many situations, while such change points are unknown, the total number of such changes may be known in advance.
We propose an efficient change point estimation algorithm, and prove it to be asymptotically consistent, under the assumption that the process distributions that generate the data are stationary ergodic, and that the correct number of change points is provided.
2. In the second formulation, we assume that is unknown, but we make no attempt to estimate it. Instead, our goal is to produce a list of at least candidate estimates, sorted in such a way that its first elements converge to the true change points. We call a method that achieves this goal for long enough x, an asymptotically consistent “list-estimator”. The idea of producing a list of potential answers as a relaxation to producing the exact solution has previously appeared in the literature. In error-control coding theory, a so-called “list-decoder” is used to produce a set of possible codewords as output (Sudan, 2000). A list model for clustering has been proposed by (M. et al., 2008) where the objective is to produce a small number of clusterings such that at least one has an arbitrarily small error with respect to the ground-truth. Note that our objective is somewhat stronger since the produced list of candidate change points is sorted in such a way that its first elements are consistent estimates of the true change points.
List-estimation can prove useful in practice. Indeed, in many scenarios the number of change points can be unknown. However, once provided with the sorted list produced by a consistent list-estimator, the domain expert will be able to identify the true change points much faster. A list-estimator can be especially useful when the input sequence is long. This is due to the fact that after list-estimation, a much smaller portion of the data (i.e. that specified by the output of the list-estimator) will be left for analysis. If the list-estimator is consistent, the first elements converge to the true change points.
At this point, the task of the expert would be merely to reject the tail of redundant estimates. This in turn requires a smaller amount of time and effort as compared to the brute-force inspection of the entire data. Let us revisit the example of inspecting a long text (or speech recording), as a means to identify the points in the data that correspond to changes in topic. Assume that, in contrast to the scenario discussed above, we have no a priori knowledge of the number of change points in this case. Thus, every point in the sequence is a potential change point. Using a list-estimator we can identify the changes by confining our inspection to the list-estimator’s output, potentially saving substantial time and effort in the overall analysis. Specifically, noting that the first elements of the list correspond to the true change points, we can sequentially inspect the list of candidate estimates produced by the list-estimator, stopping the process as soon as an estimate does not seem to correspond to a true change of topic. In much the same way, the list-estimator can be used in almost all other such applications as video surveillance, bioinformatics, market analysis, etc., where it is required to locate an unknown number of change points within a long sequence of highly dependent observations and an expert is available to further inspect the list of candidate estimates.
We show that consistent list-estimation is possible under the assumption that the process distributions generating the time series are stationary ergodic. Specifically, we propose an efficient list-estimator that generates a (sorted) list of change point estimates. As we further show, the proposed method is asymptotically consistent: the first elements of its output list converge to the true change points, provided that the unknown process distributions that generate the data are stationary ergodic.

Table of contents :

1 Introduction
1.1 Stationary ergodic framework
1.2 Proposed formulations and main results
1.3 Methodology
1.4 Related work
1.5 Summary of main contributions
1.6 Organization
2 Preliminaries
2.1 Notations and definitions
2.2 The Distributional distance and its empirical estimates
2.3 Some theoretical impossibility results in the stationary ergodic framework
3 Change point analysis
3.1 Introduction
3.2 Problem formulation
3.3 Main results
3.4 Proofs
4 Time-series clustering
4.1 Introduction
4.2 Preliminaries
4.3 Main result: a consistent online time series clustering algorithm
4.4 Proof of Theorem 4.3.1
5 Experimental evaluations
5.1 Synthetic time series generation
5.2 Change point estimation
5.3 Time series clustering
6 Discussion
6.1 Change point analysis
6.2 Online clustering
6.3 Extensions of the general framework
Bibliography