Detection in the Random Distortion Testing framework and application to mechanical ventilation system monitoring

Get Complete Project Material File(s) Now! »

Sequence-based frequency-derived features

The primary structure of a protein is given by the associated sequence of amino acids. This sequence is often represented by a string of characters sampled from an alphabet of 20 single characters representing the 20 di erent amino acids. By properly mapping these character strings into numerical sequences, time series analysis can be applied to design very high throughput methods. This conversion from symbolic to numerical sequences may rely on assigning to each amino acid numerical values that represent its physico-chemical and biochemical properties. A number of such indices have been introduced in the literature (more than 500 indices can be found in the AAIndex database [Kawashima et al., 2008]). Among them, the electron-ion inter-action pseudo-potential (EIIP) values [Cosic, 1994] and the ionization constant (IC) parameters [Cosic and Pirogova, 1998] are shown to be very relevant to the protein bioactivity. For each amino acid, the EIIP value describes the average energy states of all valence electrons of its atoms. This can be calculated using the general model of pseudo-potential [Veljkovic and Slavic, 1972]: < k j j k ! + ! >= 0:25Z sin( 1:04Z)=(2 ) (2.1).
where q is a change of momentum k of the delocalized electron in the interaction with potential w and Z is the average number of valence electrons of an atom. Let us take the calculation of the EIIP value for Asparagine (ASN) for example. Its residue (-CH2CONH2) is composed of 2 carbon (C), 1 oxygen (O), 1 nitrogen (N) and 4 hy-drogen (H) atoms. Therefore, the average number of valence electrons per atom is 4 + 1 6 + 1 5 + 4 1)=(2 + 1 + 1 + 4) = 23=8. By substituting this value Z=(2 into the formula (2.1) to compute the pseudo-potential, the EIIP value for Asparagine (ASN) is then 0:0036. The IC value of an amino acid H A measures its acid dissociation constant from the corresponding ionization reaction H A = H+ + A , computed as follows:
pKa = log10 Ka (2.2) with Ka = [H+][A ] (2.3) [H A].
where [H+], [A ] and [H A] are respectively the concentration of positively charged ions, negatively charged ions and reactant in the solution. The EIIP and IC values for the 20 amino acids occurring in nature are listed in Table 2.1. These two indices have been shown to be very successful in the so-called Resonant Recognition Model [Cosic, 1994, Cosic, 1997, Cosic and Pirogova, 1998] (cf. Section 2.4.2) to get an insight into the physical characterization of protein interactions as well as protein hotspots. In our work, these indices will be used to obtain numerical sequences for further DSP analysis.

In-sillico alanine scanning and frequency-based features

Experimental alanine scanning mutagenesis has been shown to be an extremely useful tool for analyzing interactions in protein interfaces (see [Wells, 1991, Kortemme et al., 2004] amongst others). This technique involves mutating an amino acid residue to alanine (i.e. deleting the sidechain beyond C carbon atom) and then evaluating the e ects of this mutation on the a nity of the protein interaction. These e ects can be measured by the change in binding free energy ( G) of the protein-target complex. Although experimental ASM is very powerful in identifying hotspot residue, it is still too expensive and laborious to be easily applied to large-scale analysis, despite many advances in molecular biology.

Learning-based hotspot identi cation

To computationally detect hospot, the learning-based recognition scheme is sug-gested. In this study, we exploit Random Forest (RF) [Breiman, 2001] as the learning-based classi er since it is among the most powerful techniques for supervised classi ca-tion issues. The detection is carried out on the basic of two di erent families of protein hotspot descriptors: the proposed features derived from frequency characteristics of the protein’s amino acid sequence and state-of-the-art features computed from known 3D structure of the considered proteins and/or the complexes. These two families of descriptors can be used separately or together upon the availability of the prerequisite knowledge on the 3D structure. The evaluation is also carried out on a dataset with a comparison of detection performance yielded by each of the feature families and the combination to illustrate, on the one hand, the relevance of the proposed descriptors for hotspot identi cation, and, on the other hand, the success of the proposed detection in the machine learning framework. To begin with, the evaluated features are pointed out. The considered hotspot dataset is then presented. And nally, the detection results are reported.

READ FT and FxT Observers for Linear MIMO Systems via ILF

Evaluated features

As aforementioned, we consider the two following sets of protein hotspot descriptors as the input of the RF classi er in the learning-based hotspot identi cation scheme. Frequency-derived features of amino acid sequences The frequency-based features presented in Section 2.2.2, that is, the 3 highest spec-trum peak changes, the 8 sub-band energy changes and the global energy changes, are considered. Using these measures with both EIIP and IC values, a set of 24 di er-ent features is computed. The descriptors that best discriminate hotspots from other residues will be selected. This can help reduce the dimensionality of the feature space, without a ecting the original semantics of the descriptors, thus providing the ability to interpret the result by domain experts [Saeys et al., 2007]. In this study, such a se-lection is performed by using a decision tree-based feature ranking technique [Cardie, 1993]. The technique involves growing a decision tree based on a sample set (cf. sec-tion 1.3 for more details) then pruning it at a certain level. During the growing process, a decision tree, by its nature, selects the best feature (in the sense of maximizing the information gain) each time a node is split. In the pruning phase, nodes that pro-vide less entropy gain are eliminated. Therefore, the features associated with internal nodes after pruning are considered as the most relevant features. Using the Matlab tre-e t routine, the decision tree based on samples extracted from [Tuncbag et al., 2009] showed that the 3 highest spectrum peak changes using EIIP, the energy change in the 7-th sub-band using EIIP and the global energy band using IC are the most ap-propriate candidates. These selected descriptors form a 5-dimensional vector called the sequence-based frequency-derived features in the sequel.

Table of contents :

Remerciements
Sommaire
Table of contents
Abbreviations
List of gures
List of tables
Abstract
Resume
General introduction
I Statistical decision in Machine Learning framework
Application: Protein interface hotspot detection
1. Statistical decision in Machine Learning – Random Forests (RF)
1.1. Classication tree
1.2. Bagging predictors
1.3. Random Forests
2. Protein interface hotspot detection
2.1. Introduction
2.2. Sequence-based frequency-derived features
2.2.1. Conversion to numerical sequence
2.2.2. In-sillico alanine scanning and frequency-based features
2.3. Learning-based hotspot identication
2.3.1. Evaluated features
2.3.2. Dataset
2.3.3. Hotspot identication performance assessment results
2.4. Discussion
2.4.1. Relevance of sequence-based frequency-derived features with respect to previous work
2.4.2. Physico-chemical interpretation of the proposed features
2.4.3. Comparison to other DSP-based hotspot detection methods
2.4.4. Future work
II Detection in the Random Distortion Testing framework and application to mechanical ventilation system monitoring
3. Random Distortion Testing and Signal detection
3.1. Preliminary material
3.2. Distortion Testing
3.2.1. Deterministic case (DDT)
3.2.2. Random case (RDT)
3.3. Signal detection in RDT framework
4. Detection of signal deviation/distortion using RDT
4.1. Detection of signal deviations at specic instants – Extension of RDT in sequential detection framework
4.1.1. Detection at one single critical instant
4.1.2. Repeated detections at multiple critical instants with extension of RDT in sequential detection framework
4.2. Change point detection
4.3. Detection of signal distortion in a time interval
5. Application to mechanical ventilation system monitoring: Au- toPEEP/Asynchrony detection
5.1. Introduction
5.2. Automatic detection of AutoPEEP
5.2.1. System overview
5.2.2. Detectors
5.2.3. Phase change detection
5.2.4. Estimations
5.3. Detection performance assessment
5.3.1. Simulations
5.3.2. Emulations with a respiratory system analog
5.3.3. Analysis of clinical data
5.4. Extension to detection of asynchrony
5.4.1. Trigger timing related asynchrony
5.4.2. Waveform related asynchrony
5.5. Discussions
5.5.1. Automatic detection of ventilatory support failure
5.5.2. Real-time remote monitoring framework
5.5.3. Virtual ventilatory support simulator
Conclusion
General conclusion and perspectives
Appendix A. Constraint violation of Neyman-Pearson likekihood test under model mismatch
Appendix B. The convergence of the two thresholds in the proposed dual-threshold test
Appendix C. Gaussianity of the aggregated noise when using the wave- form vector
Appendix D. Virtual ventilatory support simulator
Bibliography