Protein binding pocket similarity measure based on comparison of clouds of atoms in 3D

Get Complete Project Material File(s) Now! »

Principle of ligand-based approaches

Many ”rational” drug design efforts are based on a principle which states that structurally similar compounds are more likely to exhibit similar properties. Indeed, the observation that common substructural fragments lead to similar biological activities can be quantified from database analysis. A variety of methods, known collectively as Quantitative Structure Activity Relationship (QSAR) have been developed, essentially for the search for similarities between molecules in large databases of existing molecules whose properties are known. The discovery of such a relationship allow to predict the physical and chemical properties of biologically active compounds, and to develop new theories or to understand the phenomena observed.
Once a QSAR model has been built to encode this relationship between the chemical space and a given biological activity, this can guide the synthesis of new molecules, limiting the number of compounds to synthesize and test. The relationship between the structures of molecules and their properties or activities are usually established using methods of statistical learning. The usual techniques are based on the characterization of molecules through a set of 1D, 2D or 3D descriptors. A model is established, that relates the descriptors that encode the molecule, to its biological activity, based on a learning dataset of molecules for which this activity is known. It is then possible to use this model to predict the activity of a new molecule. Numerous studies show that it is impossible to predict accurately the affinity of chemically diverse ligands (57). It is reasonable to hope to discriminate affinity of ligands in the range of nanomolar, micromolar and millimolar. Decades of research in the fields of statistics and machine learning have provided a profusion of methods for that purpose. Their detailed presentation is far beyond the scope of this section, and we invite interested readers to refer to the classical textbooks (58; 59) for a thorough introduction. In this section we just give general methodological and historical considerations about their application in chemoinformatics.
Models can be grouped into two main categories depending on the nature of the property to be predicted. Models predicting quantitative properties, such as for instance the degree of binding to a target, are known as regression models. On the other hand, classification models predict qualitative properties. In SAR analysis, most of the properties considered are in essence quantitative, but the prediction problem is often cast into the binary classification framework by the introduction of a threshold above which the molecules are said to be globally active, and under which globally inactive. In the following, the term classification implicitly stands for such binary classification.
In order to build the model, the pool of molecules with known activity is usually split into a training set and a test set. The training set is used to learn the model. The learning problem consists in constructing a model that is able to predict the biological property on the molecules of the training set, but without over-learning on it. This overfitting phenomenon can for instance be controlled using cross-validation techniques, that quantify the ability of the model to predict a subset of the training set that was left out during the learning phase. The test set is used to evaluate the generalization properties of the learned model, corresponding to its ability to make correct prediction on a set of unseen molecules. Different criteria can be used for this evaluation. In regression, it is typically quantified by the correlation between the predicted and the true activity values. In the classification framework, a standard criterion is the accuracy of the classifier, expressed as the fraction of correctly classified compounds. However, if one of the two classes is over-represented in the training set, and/or the cost of misclassification are different, it might be safer to consider the true and false positive and negative rates of classification. The true positive (resp. negative) rate account for the fraction of compounds of the positive (resp. negative) class that are correctly predicted, and the false positive (resp. negative) rate accounts for the fraction of compounds of the negative (resp. positive) class that are misclassified. In virtual screening applications for instance, where we typically do not want to misclassify a potentially active compound, models with low false negative rates are favored, even it they come at the expense of an increased false positive rate.
Because they usually require a limited set of uncorrelated variables as input, applying these models to chemoinformatics requires to summarize the information about the molecules into a limited set of features, which may not a trivial task due to the vast amount of possible molecular descriptors. A popular way to address this problem in chemoinformatics is to rely on principal component analysis (PCA), that defines a limited set of uncorrelated variables from linear combinations of the initial pool of features, in a way to account for most of their informative content. Alternatively, feature selection methods can be used to identify among an initial pool of features a subset of features relevant with the property to be predicted. Because molecular descriptors are sometimes costly to define, a potential advantage of feature selection methods, over PCA-based approaches, is the fact that they reduce the number of descriptors to be computed for the prediction of new compounds.

READ Prediction of the Potency of Mammalian Cyclooxygenase Inhibitors with Ensemble Proteochemometric Modelling

Introduction to SVM in virtual screening

One important contribution of this thesis is to explore the use of machine learning algorithms within the newly introduced chemogenomic framework, in order to predict protein-ligand interactions. The principle of chemogenomic approaches will be presented in chapter 3. Although this principle is quite simple (i.e. similar proteins are expected to bind similar ligands), to our knowledge, only a very limited number of studies propose computational methods able to handle chemogenomic data and to perform predictions. The main reasons are that these data are not trivial to generate, to manipulate, and to be used as input in computational methods in a relevant manner in order to make predictions.
We have proposed to use kernel methods in the the context of Support Vector Machine (SVM) methods, because, as it will be explained in chapter 3, they allow easy manipulation and calculation in the chemogenomic space (i.e. the chemical space of small molecules joined to the biological space of proteins). Presenting the full mathematical framework of SVM and kernel methods is out of the scope of this thesis. However, in this section, we will briefly introduce kernel methods in the context of Support Vector Machine (SVM), because we have proposed to use this type of algorithm to implement a computational method able to handle the chemogenomic data.
The support vector machine algorithm was initially introduced in 1992 by Vapnik and coworkers in the binary classification framework (70), (69). Over the last decade this method has been gaining considerable attention in the machine learning community, which led to the emergence of a whole family of statistical learning algorithm called kernel methods (71), (72). SVM has been successfully aplied in many real world applications, including, for instance, optical character recognition (73), text-mining (74) or bioinformatics (75), often outperforming state-of-the-art approaches.
In the recent years, support vector machines and kernel methods have gained considerable attention in cheminformatics. They offer generally good performance for problems of supervised classification, and provide a flexible and computationally efficient framework to include relevant information and prior knowledge about the data and about the problems to be handled. We start this section by a brief introduction to SVM in the binary classification framework, and, in a second step, we highlight the particular role played by kernel functions in the learning algorithm.

GPCRs and signal transduction

On the basis of homology with rhodopsin, GPCR are predicted to be integral membrane proteins sharing a common global topology that consists of seven transmembrane alpha helices, an intracellular C-terminal, an extracellular N-terminal, three intracellular loops and three extracellular loops. The GPCR arranges itself into a tertiary structure resembling a barrel, with the seven transmembrane helices forming a cavity within the plasma membrane which serves a ligand-binding domain that is often covered by EL-2. This gives rise to their other names, the 7-TM receptors or the heptahelical receptors (Figure 3.1).

Table of contents :

1 Introduction
2 Background
2.1 Molecules of life
2.2 Proteins, machinery of life.
2.2.1 Enzymes
2.2.2 Receptors
2.3 Small molecules
2.4 Interactions between proteins and small molecules
2.5 Experimental methods to study protein-ligand interactions
2.5.1 Non structural approaches
2.5.1.1 Spectroscopic methods (UV, fluorescence)
2.5.1.2 Isothermal Titration Calorimetry (ITC)
2.5.1.3 Surface Plasmon Resonance (SPR)
2.5.2 Structural approaches
2.5.2.1 X-ray diffraction
2.5.2.2 Nuclear magnetic resonance (NMR)
2.6 Experimental high throughput screening (HTS)
2.7 Virtual screening
2.7.1 The molecule library
2.7.2 Structure-based methods
2.7.3 Ligand-based approaches
2.7.3.1 Descriptors
2.7.3.2 Principle of ligand-based approaches
2.7.4 Introduction to SVM in virtual screening
3 Virtual screening of GPCRs: an in silico chemogenomic approach
3.1 Introduction to GPCRs
3.2 GPCRs and signal transduction
3.3 Targeting GPCRs
3.4 Introduction to in silico chemogenomic approach
3.5 Methods
3.5.1 Encoding the chemogenomic space within the SVM framework
3.5.2 Descriptors and similarity measures for small molecules
3.5.3 Descriptors and similarity measures for GPCRs
3.5.4 Data description
3.5.4.1 Filtering the GLIDA database
3.5.4.2 Buliding of the learning dataset
3.6 Results
3.6.1 Performance of protein kernels
3.6.2 Performance of ligand kernels
3.6.3 Impact of the number of training point on the prediction performance . 60
3.6.4 Prediction performance of the chemogenomic approach on orphan GPCR 61
3.7 Discussion
3.8 Conclusion
3.9 Additional files
4 Protein binding pocket similarity measure based on comparison of clouds of atoms in 3D
4.1 Background
4.2 Methods
4.2.1 Convolution kernel between clouds of atoms
4.2.2 Related methods
4.2.3 Performance criteria
4.2.4 Data
4.3 Results
4.3.1 Kahraman Dataset
4.3.2 Homogeneous dataset (HD)
4.4 Discussion
4.5 Conclusion
4.6 additional files
5 Discussion and Perspectives
5.1 Description of the chemogenomic space
5.1.1 Description of the chemical space
5.1.2 Description of the biological space
5.1.2.1 Sequence-based approaches
5.1.2.2 Structure-based approaches
5.2 Extension of the proposed methods to SVM-based Chemogenomics methods .
5.3 Other structure-based kernels for proteins.
5.4 The learning database
5.5 Extension to proteins of unknown structures by homology modeling
5.6 Relation with docking
5.7 From prediction of protein-ligand interactions to prediction of biological effects
Bibliography