Prediction of the Potency of Mammalian Cyclooxygenase Inhibitors with Ensemble Proteochemometric Modelling

Get Complete Project Material File(s) Now! »

Synergy between ligand and target space

An analysis of the drug-target interaction network demonstrated that a given ligand interacts with six protein targets on average at therapeutic concentrations [Mestres et al. (2009)]. Targets with correlated bioactivity profiles might be related or dis-tant from a sequence similarity standpoint. It has been recently shown that the classification of class A GPCRs based on ligand activity differs considerably from that obtained when using a classic description of proteins based upon sequence alignments [Lin et al. (2013); Westen and Overington (2013)]. Hence, full sequence similarity from multiple sequence alignments would not generally correlate with similar ligand affinity. Nevertheless, kinases exhibiting a sequence identity higher than 60% tend to have similar ATP-binding sites and hence they tend to be inhibited by similar compounds [Vieth et al. (2005)]. Similarly, compound binding is more conserved between human and rat orthologous proteins with respect to paralogues [Kruger and Overington (2012); Westen et al. (2012)]. Thus, to better understand intra-family and inter-species selectivity both the target and the compound space need to be considered simultaneously.
In ligand space, chemogenomic approaches relying only on ligand data have shown that there is an unequal distribution of ligand data. This is due to the fact that some target classes (e.g. GPCRs or kinases) have been traditionally regarded as more interesting from a medicinal chemistry standpoint, and are thus overrepresented in bioactivity databases [Gregori-Puigjané and Mestres (2008b)]. Moreover, while some chemogenomic methods implicitly consider target information using bioactivity profiles of groups of similar ligands, i.e. the interaction between these compounds and a panel of targets, they are outperformed by techniques that explicitly consider target information [Gregori-Puigjané and Mestres (2008a); Rognan (2007)]. In addition, bioactivity profiles for related compounds are not always available.
In target space, techniques were employed which benefit from the structural or sequence information available and rely on groups of related targets with the aim to identify possible off-target effects and drug specificity for a particular target of interest [Rognan (2007)]. Based on the inverse similarity principle, related proteins are likely to interact with similar compounds. As in the previous case, the unavailability of data also constitutes a limitation for target-based chemogenomics. The combination of ligand and target data allows the creation of predictive models that can rationalize e.g. viral or cancer cell-line selectivity, whereas models exclusively based on ligands cannot explain the role of the target in selectivity [Westen et al. (2011b)]. Merging data from ligand and target sources into the frame of a single machine learning model allows the prediction of the most suitable pharmacological treatment for a given genotype (personalized medicine), which ligand-only and protein-only approaches are not able to perform. This is precisely the underlying principle in proteochemometrics (PCM), which employs both ligand and target features simultaneously, and which therefore enables the deconvolution of both the target and the chemical spaces in parallel [Lapinsh, Prusis, and Gutcaits (2001); Westen et al. (2011a)].

PCM as a practical approach to use chemogenomics data

PCM modelling is a computational technique which combines both ligand infor-mation and target information within a single predictive model in order to predict an output variable of interest (usually the activity of a molecule in a particular bio-logical assay) [Lapinsh, Prusis, and Gutcaits (2001); Westen et al. (2011a)]. It is this combination of orthogonous information that sets PCM apart from both QSAR and chemogenomics [Horst et al. (2011); Rognan (2007)]. Generally, the term target refers to proteins since the majority of PCM models in the literature have been devoted to the study of the activity of compounds on protein targets. Yet, target can also refer to a certain protein binding pocket (to allow distinction between binding modes, protein conformations, or allosteric/orthosteric binding) [Bahar, Chennubhotla, and Tobi (2007)], or even to a cell-line [Menden et al. (2013)]. Each binding site and each binding mode can be regarded (computationally) as a different target.
A PCM model is trained on a data set composed of a series of targets and com-pounds, where compounds have been measured on as many targets as possible (illustrated in Figure .1.1). The simultaneous modelling of the target and the ligand space permits to better understand complex drug-target interactions (e.g. selectivity) [Keiser et al. (2007); Ning, Rangwala, and Karypis (2009); Paolini et al. (2006); Wasser-mann, Geppert, and Bajorath (2009)] than it would be possible with chemogenomics. Indeed, the simultaneous modelling of compound and target data allows to assess the effect of target and chemical variability can be evaluated (e.g. protein mutations or the effect of chemical substructures on bioactivity). Thus, the aim of PCM is the complete modelling of the compound-target interaction space (Figure.1.1), including also the prediction of the bioactivity of novel compounds on yet untested targets.
Initial attempts to incorporate description of several proteins and their ligands in a single QSAR model involved modelling of the interaction between mutated glu-cocorticoid receptors and Deoxyribonucleic Acid (DNA) [Tomic, Nilsson, and Wade (2000); Zilliacus et al. (1992)]. The first full scale PCM study involving different pro-teins was devoted to the interaction of chimeric melanocortin receptors with chimeric peptides at Uppsala University [Prusis et al. (2001)]. The name proteochemometrics was coined later by the same research group [Lapinsh, Prusis, and Gutcaits (2001)]. Since then PCM has been applied on various diverse data sets (Table .1.1) [Bock and Gough (2005); Lapinsh et al. (2002)]. While the current chapter will focus on recent developments in the field between 2010 and 2013, a comprehensive discussion of PCM-related work before 2010 has been presented in a previous review by Westen et al. (2011a), to which we would like to refer the reader.

PCM in predicting ligand binding free energy

The application of PCM to docking might not be directly obvious. Yet, the concepts used in PCM, quantitatively relating ligand and protein-side descriptors to affinity/activity, very much resemble empirical scoring functions. Molecular docking has led to the discovery of active compounds [Laine et al. (2010)], yet it suffers from several well described limitations, among which is the relatively low performance in prediction of interaction energies [Yuriev, Agostino, and Ramsland (2011); Yuriev and Ramsland (2013)]. In contrast, PCM models can predict the difference in Gibbs free energy (G = -RT ln Kd) between the initial state, where the protein and the compound do not interact, and the final ligand-target complex. Therefore, the principles of PCM can be applied to develop PCM-based scoring functions.
Kramer and Gedeck (2011a) demonstrate this concept by building a structurebased PCM scoring function. Their method inducts a bagged stepwise multiple linear regression model with a subset of 1,387 protein-ligand complexes extracted from the PDBbind09-CN database [Wang et al. (2004)]. Subsequently a new compound-target interaction descriptor based upon distance-binned Crippen-like atom type pairs was introduced. The best model outperformed commercially available scoring functions assessed on the PDBbind09 database and was able to explain 48% of the variance of the external set, providing a RMSE equal to 1.44 Kd units. Although similar methods had been previously proposed [Artemenko (2008); Das, Krein, and Breneman (2010); Deng, Brenema, and Embrechts (2004); Sotriffer et al. (2008); Zhang, Golbraikh, and Tropsha (2006)], this was the first study where a sufficiently large validation was accomplished to ascertain the model’s predictive power. Additionally, the implementation of bagged stepwise multiple linear regression (MLR) and PLS enabled the evaluation of the importance of ligand and target descriptors for the PCM model. Similarly, a subsequent study reported the development of a scoring function based upon the CSAR-NRC HiQ benchmark data set (http://csardock.org) [Kramer and Gedeck (2011b)]. The best model exhibited acceptable statistics with a crossvalidated R2 = 0.55 and RMSE = 1.49 [Kramer and Gedeck (ibid.)]. Finally, Koppisetty et al. (2013) were able to predict for the first time ligand binding free energies where the enthalpic and entropic contributions for a given binding event were deconvoluted.
Therein, the authors demonstrated the importance of including ligand descriptors (QIKPROP and LIGPARSE calculated in Schrodinger suite [Schrödinger (2013)]) to the models in addition to 3D ligand-protein interaction descriptors.
As demonstrated above, PCM overlaps with methods that are originally coming from the structure-based field due to PCM describing in principle any method to relate ligand features and protein/target features on a large scale to an output variable of interest. Another source of complementary information is the information from divergent and convergent homologous sequences. This allows PCM models to extrapolate the bioactivity of ligands to the same protein target in different species as shown below.

READ A multi-gene phylogeny for species of Mycosphaerella occurring on Eucalyptus leaves

PCM as an approach to extrapolate bioactivity data between species

Given that PCM considers bioactivity data from related targets, these related targets can also include similar targets from different species. Given a group of related targets, a distinction can be made from an evolutionary standpoint between gene pairs originated from intra-species gene duplication events (paralogy, within species) or from speciation events (orthology, across species) [Koonin (2005)]. Since orthologous genes will tend to maintain the original function, binding modes will also tend to be more conserved than in paralogues, where the original protein function is less conserved.
This has also been shown to be true for affinities of ligands binding to these orthologues by analyzing bioactivity data in a recent study by Kruger and Overington (2012). The authors demonstrate that the same small molecule exhibits similar binding affinities when acting on orthologues (though some exceptions were found, e.g. Histamine H3 receptor). Moreover, the authors verified that larger differences in binding affinity are observed for paralogues with respect to orthologues by analyzing the differences in binding for a total number of 20,309 compounds on 516 human targets, with 651 being the final number of orthologous pairs. These observations aid in optimizing ligands for their interaction with conserved residues across a given protein family, thus making them more desirable lead compounds (thus avoiding their interaction with unrelated targets) [Lounkine et al. (2012)]. In the field of PCM, Lapinsh et al. (2002) demonstrated for the first time the capability of PCM to successfully combine the pKi values of 23 organic compounds on 17 human (paralogues) and 4 rat (orthologues) aminergic GPCRs. The authors were able to deconvolute the binding site interactions into two types, namely: those involved in specificity and those involved in affinity. Therefore, compound design can be envisioned from the viewpoint of affinity or specificity. Similarly, the contribution to compound affinity of TM regions involved in the interactions of aminergic GPCRs and compounds was also quantified. For example, TM regions 2, 3, 4, 6 and 7 are responsible for low overall affinity in 2 receptors; however, the same regions are positive contributors to overall high affinity in 1 receptors. Westen et al. (2012) built on this by including in a PCM model bioactivity data from four human and rat adenosine receptors (A1, A2A, A2B and A3). The authors screened a commercial chemolibrary composed of 791,162 compounds with the most predictive PCM model obtained, which exhibited Q2 and RMSE values of 0.73 and 0.61 pKi units, respectively. Prospective experimental validation led to the discovery of new high-affinity inhibitors, among which a compound with a pKi value of 8.1 on the A1 receptor. Finally (chapter .5), the authors have applied PCM to model the pIC50 value of 3,228 distinct compounds on 11 mammalian cyclooxygenases (COX) using ensemble PCM [Cortes-Ciriano et al. (2015)]. The final ensemble PCM model, trained on the cross-validation predictions of a panel of 282 RF, SVM and Gradient Boosting Machine (Gradient Boosting Machine (GBM)) models, each one trained with different values of the hyperparameters, led to predictions on the test set with RMSE and R20 values of 0.71 and 0.65, respectively. Additionally, the description of compounds with unhashed Morgan fingerprints permitted a chemically meaningful model interpretation, which highlighted chemical moieties responsible for selectivity towards COX-2 in agreement with the literature [Cortes-Ciriano et al. (ibid.)].

Table of contents :

.1 Polypharmacology Modelling Using Proteochemometrics (PCM)
.1.1 Introduction
.1.1.1 Available bioactivity data is growing: but can we make sense of it?
.1.1.2 Synergy between ligand and target space
.1.1.3 PCM as a practical approach to use chemogenomics data
.1.1.4 Practical relevance of PCM
.1.2 Machine Learning in PCM
.1.2.1 Support Vector Machines (SVM)
.1.2.2 Random Forests (RF)
.1.2.3 Gaussian Processes (GP)
.1.2.4 Collaborative Filtering (CF)
.1.3 PCM Applied to Protein Target Families
.1.3.1 G protein-coupled receptors
.1.3.2 Kinases
.1.3.3 Histone modification and DNA methylation
.1.3.4 Viral mutants
.1.4 Novel Techniques and Applications in PCM
.1.4.1 Novel target similarity measure
.1.4.2 Including 3D information of protein targets in PCM
.1.4.3 PCM in predicting ligand binding free energy
.1.4.4 PCM as an approach to extrapolate bioactivity data between species
.1.4.5 PCM applied to pharmacogenomics and toxicogenomics data .
.1.4.6 Other potential PCM applications
.1.5 PCM Limitations
.1.6 Conclusions
.2 Predictive Bioactivity Modelling
.2.1 Compound standardization
.2.2 Descriptors
.2.2.1 Target descriptors
.2.2.2 Ligand descriptors
.2.2.3 Cross-term descriptors
.2.3 Statistical Preprocessing
.2.4 Generation of PCM Models
.2.5 Commonly used Algorithms
.2.6 Validation of PCM Models
.2.6.1 Statistical metrics
.2.7 Assessment of Maximum and Minimum Achievable Model Performance
.2.8 Conformal Prediction
.2.8.1 Regression
.2.8.2 Classification
Proteochemometric Modelling in a Bayesian Framework
.3 Proteochemometric Modelling in a Bayesian Framework
.3.1 Introduction
.3.2 Materials and Methods
.3.2.1 Data sets
.3.2.2 Descriptors
.3.2.3 Modelling with Bayesian inference
.3.2.4 Computational details
.3.2.5 Assessment of maximum model performance
.3.2.6 Interpretation of ligand substructures
.3.3 Results
.3.3.1 Model validation
.3.3.2 Predicted confidence intervals follow the cumulative density function of the Gaussian distribution
.3.3.3 Analysis of GP performance per target
.3.3.4 Model interpretation of ligand descriptors
.3.4 Discussion
.3.5 Conclusion
Benchmarking the Influence of Simulated Experimental Errors in QSAR
.4 Benchmarking the Influence of Simulated Experimental Errors in QSAR
.4.1 Introduction
.4.2 Materials and Methods
.4.2.1 Data sets
.4.2.2 Data sets
.4.2.3 Molecular Representation
.4.2.4 Molecular Representation
.4.2.5 Compound Descriptors
.4.2.6 Model generation
.4.2.7 Machine Learning Implementation
.4.2.8 Simulation of Noisy Bioactivities
.4.2.9 Experimental Design
.4.3 Results
.4.4 Discussion
Prediction of the Potency of Mammalian Cyclooxygenase Inhibitors with Ensemble Proteochemometric Modelling
.5 Isoform Selectivity Prediction: COX
.5.1 Introduction
.5.2 Materials and Methods
.5.2.1 Data set
.5.2.2 Descriptors
.5.2.3 Machine learning implementation
.5.2.4 Model generation
.5.2.5 Model validation
.5.2.6 Assessment of maximum model performance
.5.2.7 Ensemble modelling
.5.2.8 Estimation of the error of individual predictions
.5.2.9 Interpretation of compound substructures
.5.3 Results
.5.3.1 Analysis of the chemical and the target space
.5.3.2 PCM validation
.5.3.3 PCM models are in agreement with the maximum achievable performance
.5.3.4 PCM outperforms both Family QSAR and Family QSAM on this data set
.5.3.5 PCM outperforms individual QSAR models
.5.3.6 Model ensembles exhibit higher performance than single PCM models
.5.3.7 The ensemble standard deviation enables the definition of informative confidence intervals
.5.3.8 Ensemble modelling enables the prediction of uncorrelated human COX inhibitor bioactivity profiles
.5.3.9 Model performance per target is related to compound diversity
.5.3.10 Interpretation of compound substructures
.5.4 Discussion
.5.5 Conclusion
Large-scale Cancer Cell-Line Sensitivity Prediction
.6 Large-scale prediction of growth inhibition paerns on the NCI60 cancer cell-line panel
.6.1 Introduction
.6.2 Materials and Methods
.6.2.1 Data sets
.6.2.2 Compound descriptors
.6.2.3 Compound clustering
.6.2.4 Model generation
.6.2.5 Model validation
.6.2.6 Conformal prediction
.6.2.7 Pathway-drug associations
.6.2.8 Comparison to previous methods
.6.3 Results
.6.3.1 Summary of the cell-line profiling data set views
.6.4 Discussion
Bibliography