End-to-end feature extraction for chemogenomics

Get Complete Project Material File(s) Now! »

Molecular bioactivity prediction to assist the modern drug discovery process

The identification of therapeutic targets relies on the overall knowledge about the considered disease. Several approaches have been developed to assist target identification. Differential gene expression analysis (which can also be transcribed into differential biological pathway expression analysis) aims at detecting genes whose levels of expression are the most perturbed in the patients compared to healthy controls. Such differential analysis can also be performed on other types of data qualifying gene « activities » such as copy number variation, mutation status, or epigenetic factors like methylation. In chapter 5, we will provide and example of how such data can be fruitfully combined in the context of breast cancer. Once the target is identified, a panoply of computational methods have been proposed to help identify ligands.

Computational approaches for molecular bioactivity prediction

One of the main methods is « docking » [13, 14, 15]. This technique is a 3D-structure based approach in which the binding energy between a small molecule and the target protein is estimated according to a set of molecular mechanics equations that model the physical interactions between the protein binding pocket and the ligand. Docking methods are usually composed of two algorithms, a search algorithm positioning the ligand within the pocket, and a scoring function that evaluates the strength of the interaction. This score is then used to rank the molecules.
The application of docking is limited by the fact that the 3D-structure of the target is required, which is usually not the case for many important protein targets such as membrane receptors. These 3D-structures are obtained experimentally by either X-ray crystallography which requires a crystallised form of the protein, or Nuclear magnetic resonance which is essentially limited to proteins of small sizes. Therefore, in practice, docking cannot be applied at a large scale in the protein space, although it can be used at large scale in the chemical space.
More generally, in order to be applicable at the proteome scale, prediction methods need to be independent from 3D-structural data. Another approach is to develop more integrative approaches, taking into account widely available data such as target protein sequences, drug chemical structures, and known drug-target network information.
Such a methodology is primarily inspired by the field of machine learning (ML). Given a learning data set, machine learning methods discover some underlying rule about the data and provide a « model » (or a « predictor ») that can be subsequently used. More precisely, if n examples (or equivalently data points, or observations, or samples) are given, these n points {x1, . . . , xn} form the data set. All data points are described by p-dimensional vectors xi 2 Rp, i 2 {1, . . . , n}. The p attributes describing the data samples are called features, or equivalently descriptors, variables or attributes. In the case where the predicted output is a real value (for instance, the affinity of an interaction), the ML predictor is called a regression model. For a categorical predicted output (for example, a binary statement: interaction or not-interaction), the ML predictor is called a classification model, or simply a classifier.
Fortunately, the amount of data and information produced by chemical research has grown large enough so that applying ML methods to predict interactions between chemicals and proteins at a genome scale becomes possible. « Drug-target interaction » (DTI) prediction, or equivalently drugs « virtual screening » (VS), can be viewed as the classification of protein/molecule couples as interacting or non-interacting pairs.

Numerical encoding for molecules and proteins

Computationally manipulating and analysing proteins or small molecules poses the problem of representing these data as vectors or, in other words, defining a set of binary or real-valued descriptors for these data and stack them to form a vector. This process of converting raw data into something more suitable for an algorithm is called featurisation.
The representation of the data must be suitable in a sense that it should contain information that is relevant to the considered prediction problem, in order to help the algorithm mapping the input representation to its output. The choice of representation for a specific problem is not trivial and is, in practice, the main control stick for an increase in performance.

Table of contents :

I Context
1 Introduction
1.1 Introduction to therapeutic research
1.1.1 Historical perspectives on drug discovery
1.1.2 The modern drug discovery process
1.2 Molecular bioactivity prediction to assist the modern drug discovery process
1.2.1 Computational approaches for molecular bioactivity prediction
1.2.2 Virtual screening assists the drug discovery process at various stages
1.3 Challenges in molecular bioactivity prediction
1.4 Contributions of the thesis
2 Representation and storage of molecules and proteins
2.1 Numerical encoding for molecules and proteins
2.1.1 Molecule numerical descriptors
2.1.2 Protein numerical descriptors
2.2 Similarity measures for molecules and proteins
2.2.1 Protein similarity measures
2.2.2 Graph-based chemical similarity measures
2.2.3 Other chemical similarity measures
2.3 Data and toolkits for drug virtual screening
2.3.1 Publicly available databases
2.3.2 Gold standard datasets
2.3.3 Freely available libraries and toolkits
II Drug virtual screening from expert-based representations for molecules and proteins
3 Drug virtual screening with expert-based representations
3.1 Drug virtual screening frameworks
3.1.1 Ligand-based and chemogenomic frameworks
3.1.2 Singletask and multitask frameworks
3.2 Feature-based and similarity-based approaches for chemogenomics
3.2.1 Definitions of feature-based and similarity-based approaches
3.2.2 State-of-the-art in feature-based and similarity-based approaches for chemogenomics .
4 A kernel-based approach for chemogenomics
4.1 Materials and methods
4.1.1 Kernel methods for chemogenomics
4.1.2 Protein kernels
4.1.3 Molecule kernels
4.1.4 Evaluation of prediction performance
4.1.5 Datasets
4.2 Results and discussion
4.2.1 Kernel selection and parametrisation
4.2.2 Performance of multitask approaches in orphan situations
4.2.3 Impact of the similarity of the training examples to the test set
4.2.4 Multitask approaches on reduced training sets
4.2.5 Impact of the distance of the intra-task examples to the query pair
4.2.6 Specificity prediction within families of proteins
4.3 Illustration of the method on withdrawn drugs
4.4 Discussion: comparison to other methods
4.5 Conclusion
5 Two applications of drug virtual screening
5.1 Drug virtual screening for cystic fibrosis research
5.1.1 Introduction
5.1.2 Impact of mutations in the CFTR gene on the function of CFTR protein
5.1.3 Therapies for the rescue of CFTR processing or activation
5.1.4 Prediction of protein targets using chemogenomics for CFTR new modulators
5.1.5 Conclusion and perspectives
5.2 Virtual screening for triple-negative breast cancer
5.2.1 Introduction
5.2.2 Characterisation of the hits
5.2.3 Identification of cancer biomarkers and drug mechanism of actions via drug response assays
III Drug virtual screening with end-to-end extracted representations for molecules and proteins
6 End-to-end encoding of graphs and sequences
6.1 End-to-end encoding of sequences
6.1.1 End-to-end encoding of protein sequences
6.1.2 End-to-end encoding of SMILES molecular representation
6.2 End-to-end encoding of undirected graphs: applications to molecular graphs
6.2.1 Graph convolutional neuron networks
6.2.2 Aggregation functions for graph convolutional networks
6.2.3 Graph-level representation combination for graph convolutional networks
6.2.4 Conclusion for our studies
7 Deep-learning for drug virtual screening
7.1 Deep learning-based feature encoders for ligand-based drug virtual screening
7.1.1 Feed-forward neuron networks on expert-based features for ligand-based drug virtual screening
7.1.2 End-to-end feature extraction with graph neuron networks for ligand-based drug virtual screening
7.2 Molecular graph and protein sequence encoders for chemogenomics
7.3 Summary and perspectives
8 End-to-end feature extraction for chemogenomics
8.1 Improving end-to-end extracted representation
8.1.1 Graph convolutional network architecture
8.1.2 Multitask learning for end-to-end feature extraction
8.2 Improving end-to-end extracted representations for chemogenomics
8.2.1 Materials
8.2.2 Methods
8.2.3 Results and discussion: comparison of the chemogenomic neuron network to baselines
8.2.4 Evaluation of graph convolutional network architectures
8.2.5 Evaluation of protein sequence network architectures
8.2.6 Evaluation of the neuron architecture combining end-to-end extracted representations of molecules and proteins
8.2.7 Evaluation of the combination of data-blinded and data-driven features
8.2.8 Evaluation of curriculum learning for chemogenomics
8.3 Conclusions
IV Perspectives
9 Short-term perspectives for chemogenomics
9.1 Representation learning for drug-target interaction prediction with 3-dimensional data
9.2 Integration of heterogeneous data sources
10 Conclusion
V Appendices
A Machine learning in brief
A.1 Machine learning basics
A.1.1 Supervised learning
A.1.2 Other machine learning settings.
A.1.3 Feature pre-processing and engineering
A.1.4 Model complexity: bias-variance trade-off and regularisation
A.2 Evaluation metrics and procedures for supervised learning
A.2.1 Model evaluation
A.2.2 Model selection
A.3 Linear models
A.3.1 Linear regression
A.3.2 Logistic regression
A.3.3 Regularised models
A.4 Tree based models
A.4.1 Decision Tree
A.4.2 Bagging trees
A.4.3 Random Forest
A.5 Similarity based models
A.5.1 Kernels and Reproducing Kernel Hilbert Space
A.5.2 Kernel trick
A.5.3 Representer Theorem
A.5.4 Kernel Ridge regression
A.5.5 Large Margin Classifier (general framework for classification with kernels)
A.5.6 Support Vector Machines (SVM)
A.6 Unsupervised learning
A.6.1 Dimensionality reduction
A.6.2 Clustering
A.7 Multitask learning
A.7.1 The singletask and multitask learning frameworks
A.7.2 Tasks similarity
A.7.3 Transfer learning
A.7.4 Feature-based multitask and transfer learning without tasks descriptors
A.7.5 Parameter-based multitask and transfer learning without task descriptors
A.7.6 Parameter-based multitask and transfer learning with task descriptors
A.8 Deep learning
A.8.1 Introduction to Artificial neural networks
A.8.2 Deep neuron networks useful architectures
A.8.3 Multitask with deep neuron networks
A.8.4 Considerations about the training of deep neuron networks
A.8.5 Interpretability and understanding of deep neuron networks
B Multi-kernel learning for drug virtual screening
C An historical perspective on graph representation learning
C.1 Graph shallow embeddings
C.2 Graph neuron networks
C.3 Pioneering work on molecular graph representation learning
C.3.1 Graph neuron network and graph convolutional network seminal studies
C.3.2 Approximate inference on undirected graphs
C.3.3 RNN on directed acyclic graphs
C.3.4 Convolutional neuron networks on Lewis formula images
D Analysis of graph representation learning
E Inverse Drug Design via automatic molecule generation
E.1 Generative models
E.2 Molecule generation approaches
E.2.1 Molecular fingerprint generation
E.2.2 SMILES generation
E.2.3 Graph generation
E.3 Biasing the molecule generation process
E.3.1 Guiding molecular generators via overfitting
E.3.2 Guiding molecular generators via incorporation of the property in the input
E.3.3 Guiding molecular SMILES generators’ latent encoding via Bayesian optimisation
E.3.4 Guiding molecular generators via reinforcement learning
E.4 Evaluation of molecule generators
E.4.1 Metrics for de novo drug design models.
E.4.2 Brief comparison of de novo drug design models.