Gaussian Mixture Models and the EM algorithm

Get Complete Project Material File(s) Now! »

BACKGROUND

In this chapter we present the minimum background required to follow the rest of the thesis. Due to the fact that we addressed several problems on different types of data, we had to investigate a wide range of methods and algorithms. However, in this section we describe only the key components used in each problem. For comprehensive material in the area, one should refer to some remarkable Machine Learning books such as Bishop (2006), Barber (2012), Goodfellow et al. (2016), Koller and Friedman (2009) and Theodoridis (2015). TP

learning from data

Machine learning is about extracting knowledge from data in order to create models that can perform tasks effectively. A typical Machine Learning application consists of the following parts:
A task and a metric for evaluation. The task comes in-herently given a real problem and the metric quantifies effectiveness of a solution.
A model family that we believe is capable of solving the problem in hand. The selection of the model type depends on several factors, such as the amount of available train-ing data (i.e. size of the given dataset), the complexity of the task and knowledge about its performance on similar problems Caruana and Niculescu-Mizil (2006).
A dataset on which the best model will be trained in order to solve the task, aiming to the best performance with respect to the evaluation metric.
A loss function that quantifies the goodness of fit. In con-trast to the evaluation metric, the loss function is a differen-tiable and usually model-specific Bottou (2010).
An optimization algorithm to train the model. The mod-els consist of parameters (usually referred as q), whose values reflect the performance on the loss function. Thus, an optimization strategy is required in order to select the parameter set that minimizes the loss function.
Note the difference between the first and the fourth point. In several cases these functions can be the same, for example for house price prediction, the Mean Squared Error (MSE) can be both the task objective and the loss function.
In Machine learning we are creating models that can best under-stand a set of available data (aka a dataset) in order to perform a specific task. A dataset X consists of a usually finite number of samples xi, i = 1, . . . n, such as images, documents or users, depending on the application.

Supervised Learning and Evaluation Metrics

Supervised learning occurs when the model is trained using input-output pairs, like in classification and regression. In this
scenario, the dataset consists of two subsets, the feature set X and the label set Y, both having the same cardinality (jX j = jYj), which is equal to the number of available samples.
On the other hand, unsupervised learning occurs when the model is trained using only the feature set X . The most popular unsupervised tasks is clustering, since the cluster labels are not known in advance.
In the case of supervised learning, since the labels are known in advance, we can directly assess the performance of our ap-proach using standard metrics. When the labels are continuous (Regression), yi 2 Rd, yi 2 Y, the popular performance metrics are the MSE, the Mean Absolute Error (MAE) and their weighted alternatives, where each sample comes with a different weight.
When the labels correspond to distinct categories (Classifica-tion) several metrics have been proposed and are mostly related to the distribution of the classes: Accuracy, Precision/Recall, F1-Score are among the most popular ones. In order to calculate these metrics, one has to compute the confusion matrix (Table 2.1).
In the context of predictive maintenance one of the most im-portant problems is to predict equipment failures. If we consider this problem as a binary classification, our errors correspond to either false alarms (False Positives) or missed failures (False Negatives). Depending on the application, we have to decide the importance of the types of errors that we encounter and we have to deal with the balance of False Positives and False Negatives according to industrial parameters such as the expected cost. For example, in cases where alarms trigger costly maintenance processes, avoiding many False Positives is very important.

probability

Survival Analysis

Survival analysis Miller (2011) is widely used to extract temporal information about events. The most common application appears in biology and medicine, where common events of interest are population deaths, treatments etc. Furthermore, in the context of survival analysis, questions of major interest are whether there exist variables (covariates and interactions among them) that af-fect the time to an event and whether two or more populations exhibit significant difference on the time of occurrence of the event of interest. In predictive maintenance, events usually cor-respond to equipment failure and in engineering applications survival analysis is called reliability theory.

READ Turbulent regime from Navier-Stokes : Reynolds number

Survival data and Censoring

A typical dataset in survival analysis consist of subjects and their corresponding times to the event of interest (i.e. the survival time) and occasionally, other variables associated with the subjects if they are available. Figure (2.1), shows a small example of depicting a survival dataset, consisting of 6 subjects (AC1 AC6) and their times to the event of interest, e, (t1 t6). All the subject’s observations are aligned in time and they begin at t = 0.
A very important concept in survival analysis is censoring. Censoring occurs when we are not sure about the exact time of the event’s occurrence due to incomplete data. For example, if the event of interest does not occur for a subject in the population during the observation period, then we know that the survival time for this subject is at least as long as the observation period. This is probably the most common case of censoring. The three types of censoring are:
right censoring When the time to the event of interest is more than some known value. For example, when the observation period ended without observing the event.
left censoring When the time to the event of interest is less than some known value.
interval censoring When the time to the event of interest is within a known interval.
In survival analysis applications it is very important to handle the censored data and therefore, all types of censoring have been extensively studied Klein and Moeschberger (2005). However, in the case of equipment failures we deal mostly with right censoring.

Survival and Hazard Functions

The key components of survival analysis survival and the hazard functions. The survival function (2.1), S(t), is the probability of an event occurring after time t (or equivalently surviving at least until time t), i.e. the probability of a failure not occurring before time t, S(t) = p(T > t). (2.1)
The most important properties of the survival function are presented in Table 2.2. In brief, it is a non-increasing function with respect to time and at the beginning of the observations it always equals to 1, because of the assumption that no subject dies at t = 0.

Table of contents :

1 introduction
1.1 Scope of the Thesis
1.1.1 Predictive Maintenance
1.1.2 Time Series Data
1.2 Data Related to Aircraft Operation
1.2.1 Tools and Libraries
1.3 Overview of Contributions
1.4 outline of the thesis
2 background
2.1 Learning from Data
2.1.1 Supervised Learning and Evaluation Metrics
2.2 Probability
2.2.1 Survival Analysis
2.2.2 Survival data and Censoring
2.2.3 Gaussian Mixture Models and the EM algorithm
2.3 Regression
2.3.1 Random Forests
2.3.2 Model Evaluation
2.3.3 Hyperparameter Selection
2.4 Learning as Optimization
2.4.1 The Gradient Descent
2.4.2 Convex Quadratic Programming
3 survival analysis for failure-log exploration
3.1 Introduction
3.1.1 Random Variables in event logs
3.1.2 Building a Dataset for Survival Analysis
3.2 Time Interval Between Failures
3.2.1 Kaplan – Meier method
3.2.2 Cox Proportional Hazards
3.3 Studying inter-event temporal differences
3.4 Summary
4 failure prediction in post flight reports
4.1 Introduction
4.2 Related Work
4.3 Event Log Data & Preprocessing
4.3.1 Preprocessing
4.4 Methodology
4.4.1 Multiple Instance Learning Setup
4.4.2 Prediction
4.4.3 Method summary
4.4.4 Parameters
4.5 Experimental Setup
4.5.1 Dataset
4.5.2 Training, Validation and Test
4.5.3 Baseline Algorithm
4.5.4 Evaluation at the episode level
4.6 Results
4.6.1 Bag-level Performance
4.6.2 Episode-Level Performance
4.6.3 Decision threshold selection
4.6.4 False Positives
4.6.5 Model Interpretation
4.7 Conclusions and future work
4.7.1 Infusion and Impact
5 logbook data preprocessing
5.1 Related Work
5.2 Logbook Data in Aviation
5.2.1 Data Description
5.2.2 Cleaning the Logbook
5.3 The Importance of Logbook Data
5.4 Context-aware Spell Correction via Word Embeddings
5.4.1 Word Embeddings & the skip-gram Model
5.4.2 Creating word embeddings from logbook entries
5.5 logbook cleaning using word embeddings
5.5.1 Mapping spelling errors to correct words
5.5.2 Method Summary
5.6 information extraction
5.7 Conclusions and future work
6 component condition assessment using time series data
6.1 Degradation
6.2 Related Work
6.3 Dataset
6.4 Modeling Degradation with GMMs
6.5 Time series decomposition
6.5.1 Quadratic Programming Formulation
6.5.2 Reformulating the Optimization Problem
6.6 Condition Assessment
6.7 Evaluation
6.7.1 Discussion
6.8 Conclusions
7 discussion
7.1 Summary of Contributions
7.2 Future Directions
notation
acronyms