Statistical decision in Machine Learning framework Application: Protein interface hotspot detection

Get Complete Project Material File(s) Now! »

Conversion to numerical sequence

The primary structure of a protein is given by the associated sequence of amino acids. This sequence is often represented by a string of characters sampled from an alphabet of 20 single characters representing the 20 di erent amino acids. By properly mapping these character strings into numerical sequences, time series analysis can be applied to design very high throughput methods. This conversion from symbolic to numerical sequences may rely on assigning to each amino acid numerical values that represent its physico-chemical and biochemical properties. A number of such indices have been introduced in the literature (more than 500 indices can be found in the AAIndex database [Kawashima et al., 2008]). Among them, the electron-ion inter-action pseudo-potential (EIIP) values [Cosic, 1994] and the ionization constant (IC) parameters [Cosic and Pirogova, 1998] are shown to be very relevant to the protein bioactivity. For each amino acid, the EIIP value describes the average energy states of all valence electrons of its atoms. This can be calculated using the general model of pseudo-potential [Veljkovic and Slavic, 1972]: < k j j k ! + ! >= 0:25Z sin( 1:04Z)=(2 ) (2.1) where q is a change of momentum k of the delocalized electron in the interaction with potential w and Z is the average number of valence electrons of an atom. Let us take the calculation of the EIIP value for Asparagine (ASN) for example. Its residue (-CH2CONH2) is composed of 2 carbon (C), 1 oxygen (O), 1 nitrogen (N) and 4 hy-drogen (H) atoms. Therefore, the average number of valence electrons per atom is 4 + 1 6 + 1 5 + 4 1)=(2 + 1 + 1 + 4) = 23=8. By substituting this value Z=(2 into the formula (2.1) to compute the pseudo-potential, the EIIP value for Asparagine (ASN) is then 0:0036. The IC value of an amino acid H A measures its acid dissocia-
tion constant from the corresponding ionization reaction H A = H+ + A , computed as follows: pKa = log10 Ka (2.2) with Ka = [H+][A ] (2.3) [H A].
where [H+], [A ] and [H A] are respectively the concentration of positively charged ions, negatively charged ions and reactant in the solution. The EIIP and IC values for the 20 amino acids occurring in nature are listed in Table 2.1. These two indices have been shown to be very successful in the so-called Resonant Recognition Model [Cosic, 1994, Cosic, 1997, Cosic and Pirogova, 1998] (cf. Section 2.4.2) to get an insight into the physical characterization of protein interactions as well as protein hotspots. In our work, these indices will be used to obtain numerical sequences for further DSP analysis.

In-sillico alanine scanning and frequency-based features

Experimental alanine scanning mutagenesis has been shown to be an extremely useful tool for analyzing interactions in protein interfaces (see [Wells, 1991, Kortemme et al., 2004] amongst others). This technique involves mutating an amino acid residue to alanine (i.e. deleting the sidechain beyond C carbon atom) and then evaluating the e ects of this mutation on the a nity of the protein interaction. These e ects can be measured by the change in binding free energy ( G) of the protein-target complex. Although experimental ASM is very powerful in identifying hotspot residue, it is still too expensive and laborious to be easily applied to large-scale analysis, despite many advances in molecular biology.

Physico-chemical interpretation of the proposed features

The analysis of frequency-based features of 1D numerical representations of the pro-tein amino acid sequence was initially motivated by the RRM [Cosic, 1994], a physico-mathematical model which was originally introduced as an attempt to get an insight into the selectivity of protein interactions. By assigning to each amino acid a physical parameter value relevant to the protein bioactivity and analyzing the resulting numer-ical sequence, the RRM has successfully revealed the existence of frequency character-istics that characterize how a protein can recognize its target in an interaction. From the RRM perspective, proteins of the same family, sharing the same biological func-tion, also share some frequency-based features. In particular, their frequency spectra exhibit a common characteristic frequency [Cosic, 1997]. This characteristic frequency was identi ed from the consensus spectrum, which is de ned as the multiple cross-spectrum function of the Fourier transforms of all the sequences of the protein family as in [Cosic, 1997]:
M(n) = jX1(n)j:jX2(n)j:::jXK (n)j; n = 0; 1; :::; N 1.
where Xi(n); i = 1; 2; :::; K are the discrete Fourier transform coe cients of the numer-ical representation of the i-th protein sequence of the family, K is the number of family sequences and N is the length of the longest sequence. Shorter sequences are lled up with their mean value to have the same length N. Figure 2.4 reports the consensus spectrum of the broblast growth factor (FGF) family. This consensus spectrum clearly exhibits a characteristic frequency at fc = 0:4567, which is signi cantly present in all the sequences of the FGF family.
It was conjectured in [Cosic, 1997] that these characteristic frequencies are associ-ated with the common function of the proteins of a given family. Since hotspots are referred to as the key positions that determine the protein function, they were de ned by Cosic et al. [Cosic, 1997] as the residues that are most a ected by any change made to the amplitude spectrum at the characteristic frequency corresponding to the protein biological function. Although some evidence of the correlation between the hotspots de ned by RRM and those detected by ASM were reported [Ramachandran et al., 2004, Ramachandran and Antoniou, 2008, Sahu and Panda, 2009], the recognition per-formance was limited to very few examples. Besides, earlier applications of the RRM required the functional family of the protein to be known to compute the corresponding characteristic frequency. Our approach does not impose such a constraint. Rather than a purely DSP-based approach aimed at detecting local residues associated with the characteristic frequency, we combine DSP tools and mutagenesis principles. We locally determine frequency-related energy changes resulting from the computational muta-tion of residue subsets to alanines. Considering the alanine mutations as a reference model, our procedure can be applied to newly sequenced or unclassi ed proteins, which might enlarge its potential application domain. Moreover, we have reported an actual evaluation of hotspot recognition performance with respect to a reference database of experimental ASM hotspots, which is, to the best of our knowledge, the largest available validated dataset of hotspots.
Our results bring new evidence to support the conjecture of Cosic et al. [Cosic, 1997] that protein hotspots are associated with frequency features of physico-chemical characteristics of the amino acid sequence. Whereas this statement was analyzed in [Cosic, 1997] for the RRM model associated with electron-ion interaction potentials, we have shown here that protein hotspots may also involve speci c frequency-related features for other physico-chemical characteristics such as ionization constants. Future work should further investigate, from both the computational and the biophysical point of view, the characterization and the interpretation of such frequency-related properties of protein and associated hotspots.

READ Cytochemical staining for GGT enzyme activity expressed in activated monocytes.

Comparison to other DSP-based hotspot detection methods

As aforementioned, motivated by the nding of the protein characteristic frequency in [Cosic, 1994], many studies, such as those in [Ramachandran et al., 2004,Ramachandran and Antoniou, 2008,Deergha Rao and Swamy, 2008,Sahu and Panda, 2009], have been carried out to solve the hotspot detection problem by digital signal processing (DSP) techniques. Basically, by analyzing the signal representing the considered amino acid sequence in the transform domain, these approaches attempt to locate the portion of the equivalent signal that contributes the most to the characteristic frequency, and by thus, to identify hotspots. For example, in [Ramachandran et al., 2004], a Short-time Fourier transform was used and high-energy regions in the time-frequency spectrum were investigated. Similarly, in [Deergha Rao and Swamy, 2008], the wavelet transform was considered and in [Sahu and Panda, 2009], the S-Transform was taken into account. Although being illustrated via a few well-known protein families, the detection results presented by these approaches were somehow limited. With regard to the dimension of the problem, such methods based on a single descriptor, which characterizes speci c high-energy regions in the transform domain, can hardly provide a good solution in practice. In this respect, the approach proposed in this work has overcome this limita-tion by making it possible to get multiple descriptors involved. These descriptors can be of various nature and can be resulted from di erent measurements and processings in practice. The descriptors yielded by the transformations referred in this section could also be included. In this respect, the machine learning based method exposed in this chapter presents relevant detection results.

Table of contents :

Remerciements
Sommaire
Table of contents
Abbreviations
List of gures
List of tables
Abstract
Resume
General introduction
I Statistical decision in Machine Learning framework Application: Protein interface hotspot detection
1. Statistical decision in Machine Learning – Random Forests (RF)
1.1. Classication tree
1.2. Bagging predictors
1.3. Random Forests
2. Protein interface hotspot detection
2.1. Introduction
2.2. Sequence-based frequency-derived features
2.2.1. Conversion to numerical sequence
2.2.2. In-sillico alanine scanning and frequency-based features
2.3. Learning-based hotspot identication
2.3.1. Evaluated features
2.3.2. Dataset
2.3.3. Hotspot identication performance assessment results
2.4. Discussion
2.4.1. Relevance of sequence-based frequency-derived features with respect to previous work
2.4.2. Physico-chemical interpretation of the proposed features
2.4.3. Comparison to other DSP-based hotspot detection methods
2.4.4. Future work
II Detection in the Random Distortion Testing framework and application to mechanical ventilation system monitoring
3. Random Distortion Testing and Signal detection
3.1. Preliminary material
3.2. Distortion Testing
3.2.1. Deterministic case (DDT)
3.2.2. Random case (RDT)
3.3. Signal detection in RDT framework
4. Detection of signal deviation/distortion using RDT
4.1. Detection of signal deviations at specic instants – Extension of RDT in sequential detection framework
4.1.1. Detection at one single critical instant
4.1.2. Repeated detections at multiple critical instants with extension of RDT in sequential detection framework
4.2. Change point detection
4.3. Detection of signal distortion in a time interval
5. Application to mechanical ventilation system monitoring: Au- toPEEP/Asynchrony detection
5.1. Introduction
5.2. Automatic detection of AutoPEEP
5.2.1. System overview
5.2.2. Detectors
5.2.3. Phase change detection
5.2.4. Estimations
5.3. Detection performance assessment
5.3.1. Simulations
5.3.2. Emulations with a respiratory system analog
5.3.3. Analysis of clinical data
5.4. Extension to detection of asynchrony
5.4.1. Trigger timing related asynchrony
5.4.2. Waveform related asynchrony
5.5. Discussions
5.5.1. Automatic detection of ventilatory support failure
5.5.2. Real-time remote monitoring framework
5.5.3. Virtual ventilatory support simulator
Conclusion
General conclusion and perspectives
Appendix A. Constraint violation of Neyman-Pearson likekihood test under model mismatch
Appendix B. The convergence of the two thresholds in the proposed dual-threshold test
Appendix C. Gaussianity of the aggregated noise when using the wave- form vector
Appendix D. Virtual ventilatory support simulator
Bibliography