Detecting pairwise co-occurrences using hypothesis testing-based ap-proaches: Null models and T-Patterns algorithm

Get Complete Project Material File(s) Now! »

Other projects related to railway predictive maintenance

Numerous projects have been developed in the railway domain that are not only related to railway infrastructure and vehicles but to other applications as well. For example, in (Ignesti et al., 2013), the authors presented an innovative Weight-in-Motion (WIM) algorithm aiming to estimate the vertical axle loads of railway vehicles in order to evaluate the risk of vehicle loading. Evaluating constantly the axle load conditions is important especially for freight wagons, which are more susceptible to be subjected to risk of unbalanced loads which can be extremely dangerous both for the vehicle running safety as well as for infrastructure in-tegrity. This evaluation could then easily identify potentially dangerous over-loads or defects of rolling surfaces. When an overload is detected, the axle would be identi ed and monitored with non-destructive controls to avoid and prevent the propagation of potentially dangerous fatigue cracks. Other examples include the work in (Liu et al., 2011), where the Apriori algorithm is applied on railway tunnel lining condition monitoring data in order to extract frequent association rules that might help enhance the tunnel’s maintenance e orts. Also, in (Vettori et al., 2013), a localization algorithm is developed for railway vehicles which could enhance the performances, in terms of speed and position estimation accuracy, of the classical odometry algorithms.
Due to the high cost of train delays and the complexity of schedule modi cations, many approaches were proposed in the recent years in an attempt to predict train delays and optimize scheduling. For example, in (Cule et al., 2011), a closed-episode mining algorithm, CLOSEPI, was applied on a dataset containing the times of trains passing through characteristic points in the Belgian railway net-works. The aim was to detect interesting patterns that will help improve the total punctuality of the trains and reduce delays. (Flier et al., 2009) tried to discover dependencies between train delays in the aim of supporting planners in improving timetables. Similar projects were carried out in the Netherlands (Goverde, 2011; Nie and Hansen, 2005; Weeda and Hofstra, 2008), Switzerland (Flier et al., 2009), Germany (Conte and Shobel, 2007), Italy (De Fabris et al., 2008) and Denmark (Richter, 2010), most of them based on association rule mining or classi cation techniques.
In the next section, we present the applicative context of this thesis.

Applicative context of the thesis: TrainTracer

TrainTracer is a state-of-the-art centralized eet management (CFM) software con-ceived by Alstom to collect and process real-time data sent by eets of trains equipped with on-board sensors monitoring various subsystems such as the auxiliary converter, doors, brakes, power circuit and tilt. Figure 2.3 is a graphical illustration of Alstom’s TrainTracerTM . Commercial trains are equipped with positioning (GPS) and com-munications systems as well as on-board sensors monitoring the condition of various subsystems on the train and providing a real-time ow of data. This data is transferred wirelessly towards centralized servers where it is stocked, exploited and analyzed by the support team, maintainers and operators using a secured intranet/internet access to provide both a centralized eet management and uni ed train maintenance (UFM).
Figure 2.3: Graphical Illustration of Alstom’s TrainTracerTM . Commercial trains are equipped with positioning (GPS) and communications systems as well as onboard sensors monitoring the condition of various subsystems on the train and providing a real-time ow of data. This data is transferred wirelessly towards centralized servers where it is stocked and exploited.

Applicative context of the thesis: TrainTracer

TrainTracer Data

The real data on which this thesis work is performed was provided by Alstom trans-port, a subsidiary of Alstom. It consists of a 6-month extract from the TrainTracer database. This data consists of series of timestamped events covering the period from July 2010 to January 2011. These events were sent by the Trainmaster Command Con-trol (TMCC) of a eet of pendolino trains that are currently active. Each one of these events is coupled with context variables providing physical, geographical and technical information about the environment at the time of occurrence. These variables can be either boolean, numeric or alphabetical. In total, 9,046,212 events were sent in the 6-month period.
Although all events are sent by the same unit (TMCC) installed on the vehicles, they provide information on many subsystems that vary between safety, electrical, mechanical and services (consider gure 2.4). There are 1112 distinct event types existing in the data extract with varying frequencies and distributions. Each one of these event types is identi ed by a unique numerical code.

Event Criticality Categories

Events belonging to the same subsystem may not have the same critical impor-tance. Certain events can indicate normative events (periodic signals to indicate a functional state), or are simply informative (error messages, driver information messages) while others can indicate serious failures, surpass of certain thresholds whose attributes were xed by operators or even unauthorized driver actions. For this reason, events were divided by technical experts into various intervention cat-egories describing their importance in terms of the critical need for intervention.

Applicative context of the thesis: TrainTracer

The most critical category is that of events indicating critical failures that require an immediate stop/slow down or redirection of the train by the driver towards the nearest depot for corrective maintenance actions. Example: the « Pantograph Tilt Failure » event. These events require high driver action and thus we refer to their category by \Driver Action High ».

Target Events

As mentioned before, events are being sent by sensors monitoring subsystems of diverse nature: passenger safety, power, communications, lights, doors, tilt and traction etc. Among all events, those requiring an immediate corrective maintenance action are considered as target events, that is mainly, all \Driver Action High » events. In this work, we are particularly interested in all subsystems related to tilt and traction. The tilt system is a mechanism that counteracts the uncomfortable feeling of the centrifugal force on passengers as the train rounds a curve at high speed, and thus enables a train to increase its speed on regular rail tracks. The traction system is the mechanism responsible for the train’s movement. Railways at rst were powered by steam engines. The rst electric railway motor did not appear until the mid 19th century, however its use was limited due to the high infrastructure costs. The use of Diesel engines for railway was not conceived until the 20th century, but the evolution of electric motors for railways and the development of electri cation in the mid 20th century paved the way back for electric motors, which nowadays, powers practically all commercial locomotives (Faure, 2004; Iwnicki, 2006). Tilt and traction failure events are considered to be among the most critical, as they are highly probable to cause a mandatory stop or slowdown of the train and hence impact the commercial service and induce a chain of costly delays in the train schedule.
In the data extract under disposal, Tilt and Traction driver action high failure events occur in variable frequencies and consist a tiny portion of 0.5% of all events. Among them, some occur less than 50 times in the whole eet of trains within the 6-month observation period.

Raw data with challenging constraints

In order to acquire a primary vision of the data and to identify the unique charac-teristics of target events, a graphical user interface (GUI) was developed using Matlab environment. This interface enabled the visualization of histograms of event frequencies per train unit as well as in the whole data and provided statistics about event counts and inter-event times (Figure 2.5).
Another graphical interface was developed by a masters degree intern (Randria-manamihaga, 2012) working on the same data and was also used to visualize the en-semble of sequences preceding the occurrences of a given target event. This interface is shown in Figure 2.6. Figure 2.7 is one of many examples of data visualization we can obtain. In this gure, we can visualize a sequence of type (ST ; tT t; tT ) where ST is the sequence of events preceding target event (T; tT ).
The variation in event frequencies is remarkable. Some events are very frequent while others are very rare. Out of the 1112 event types existing in the data, 444 ( 40%) have occurred less than 100 times on the eet of trains in the whole 6-month observation period (see Figure 2.8). These events, although rare, render the data mining process more complex.
Another major constraint is the heavy redundancy of data. A sequence w # [fAg] of the same event A is called redundant (also called bursty), see Figure 2.9, if in a small lapse of time (order of seconds for example), the same event occurs multiple times. More formally, if w # [fAg] = h (A; t1); (A; t2); : : : ; (A; tn) i is a sequence of n A events subject to a burst, then 9 t = tfusion such as 8 (i; j) 2 f1; : : : ; ng2 ; j ti tj j tfusion (2.1)
The reasons to why these bursts occur are many. For example, the same event can be detected and sent by sensors on multiple vehicles in nearly the exact time. It is obvious that only one of these events needs to be retained since the others do not contribute with any supplementary useful information. These bursts might occur due to emission error caused by a hardware or software failure, as well as reception error caused by similar factors.

READ The Classical approach to the Exchange Rate

Applicative context of the thesis: TrainTracer

Figure 2.10 illustrates data bursts in a sequence. We can identify two types of bursts. The rst type consists of a very dense occurrence of multiple types of events within a short time lapse. Such bursts can occur normally or due to a signalling/reception error. The second type on the other hand consists of a very dense occurrence of a single event type within a short period of time, usually due to a signalling or reception error as well (event sent multiple times, received multiple times). Bursty events can be generally identi ed by a typical form of the histogram of inter-event times depicted in Figure 2.11. This latter has a peak of occurrences (usually from 0 to 15 seconds) that we can relate to bursts. For example, 70% of all the occurrences of the code 1308 1 (an event belonging to category 4 and appears in the data 150000 times) are separated by less than one second!

Positioning our work

Cleaning bursts

Several pre-treatment measures have been implemented to increase the e ciency of data mining algorithms to be applied. For instance, 13 normative events that are also very frequent were deleted from the data. Erroneous event records with missing data or outlier timestamps were also neglected in the mining process. The work by (Randriamanamihaga, 2012) during a masters internship on the TrainTracer data has tackled the bursts cleaning problem and applied tools based on nite probabilistic mixture models as well as combining events of the same type occurring very closely in time ( 6 seconds, keeping the rst occurrence only) to decrease the number of bursts. This cleaning process has decreased the size of data to 6 million events (instead of 9.1), limited the number of distinct event codes to 493 (instead of 1112), and the number of available target events to 13 (instead of 46). Although a signi cant proportion of data was lost, the quality of the data to be mined was enhanced, which leads to a better assessment of applied algorithms and obtained results. For this reason, the resulting \cleaned » data was used in this thesis work

Positioning our work

In the railway domain, instrumented probe vehicles that are equipped with dedicated sensors are used for the inspection and monitoring of train vehicle subsystems. Main-tenance procedures have been optimized since then so that to rely on the operational state of the system (Condition-based maintenance) instead of being schedule-based. Very recently, commercial trains are being equipped with sensors as well in order to perform various measures. The advantage of this system is that data can be collected more frequently and anytime. However, the high number of commercial trains to be equipped demands a trade-o between the equipment cost and their performance in order to install sensors on all train components. The quality of these sensors re ects directly on the frequency of data bursts and signal noise, both rendering data more challenging to analyze. These sensors provide real-time ow of data consisting of geo-referenced events, along with their spatial and temporal coordinates. Once ordered with respect to time, these events can be considered as long temporal sequences that can be mined for possible relationships.

Table of contents :

1 Introduction
1.1 Context and Problematic
1.2 Positioning, objectives and case study of the thesis
1.3 Organization of the dissertation
2 Applicative context: Predictive maintenance to maximize rolling stock availability
2.1 Introduction
2.2 Data Mining: Denition and Process Overview
2.3 Railway Context
2.3.1 Existing Maintenance Policies
2.3.2 Data mining applied to the railway domain: A survey
2.4 Applicative context of the thesis: TrainTracer
2.4.1 TrainTracer Data
2.4.2 Raw data with challenging constraints
2.4.3 Cleaning bursts
2.5 Positioning our work
2.5.1 Approach 1: Association Analysis
2.5.2 Approach 2: Classication
3 Detecting pairwise co-occurrences using hypothesis testing-based ap-proaches: Null models and T-Patterns algorithm
3.1 Introduction
3.2 Association analysis
3.2.1 Introduction
3.2.2 Association Rule Discovery: Basic notations, Initial problem
3.3 Null models
3.3.1 Formalism
3.3.2 Co-occurrence scores
3.3.3 Randomizing data: Null models
3.3.4 Calculating p-values
3.3.5 Proposed Methodology: Double Null Models
3.4 T-Patterns algorithm
3.5 Deriving rules from discovered co-occurrences
3.5.1 Interestingness measures in data mining
3.5.2 Objective interestingness measures
3.5.3 Subjective Interestingness measures
3.6 Experiments on Synthetic Data
3.6.1 Generation Protocol
3.6.2 Experiments
3.7 Experiments on Real Data
3.8 Conclusion
4 Weighted Episode Rule Mining Between Infrequent Events
4.1 Introduction
4.2 Episode rule Mining in Sequences
4.2.1 Notations and Terminology
4.2.2 Literature review
4.3 Weighted Association Rule Mining: Relevant Literature
4.4 The Weighted Association Rule Mining Problem
4.5 Adapting the WARM problem for temporal sequences
4.5.1 Preliminary denitions
4.5.2 WINEPI algorithm
4.5.3 Weighted WINEPI algorithm
4.5.4 Calculating weights using Valency Model
4.5.5 Adapting Weighted WINEPI to include infrequent events
4.5.6 Adapting Weighted WINEPI to focus on target events: Oriented Weighted WINEPI
4.5.7 Experiments on synthetic data
4.5.8 Experiments on real data
4.6 Conclusion
5 Pattern recognition approaches for predicting target events
5.1 Pattern Recognition
5.1.1 Introduction
5.1.2 Principle
5.1.3 Preprocessing of data
5.1.4 Learning and classication
5.2 Supervised Learning Approaches
5.2.1 K-Nearest Neighbours Classier
5.2.2 Naive Bayes
5.2.3 Support Vector Machines
5.2.4 Articial Neural Networks
5.3 Transforming data sequence into a labelled observation matrix
5.4 Hypothesis testing: choosing the most signicant attributes
5.5 Experimental Results
5.5.1 Choice of performance measures
5.5.2 Choice of scanning window w
5.5.3 Performance of algorithms
5.6 Conclusion
6 Conclusion and Perspectives
6.1 Conclusion
6.2 Future Research Directions