Deep learning for metagenomics using embeddings

Get Complete Project Material File(s) Now! »

Unsupervised Deep Self-Organising Maps

In an unsupervised setting, the feature selection procedure is completely unsupervised, and the algorithm performs only the first step, a forward pass. In this forward pass, we construct a deep structure layer-wise, where each layer consists of the clusters representatives from the previous level. A natural question which arises is whether such an unsupervised feature selection can be beneficial for a prediction task. Although it is currently impossible to provide a theoretical foundation for it, there is an intuition why a deep unsupervised feature selection is expected to perform and performs better in practice. Real data are always noisy, and a “good” clustering or dimensionality reduction can significantly reduce the noise. If features are tied into clusters of “high quality”, then it
is easier to detect a signal from data, and the generalizing classification performance is higher. The hierarchical feature selection plays here a role of a filter, and a filter with multiple layers seems to perform better than a one-layer filter.

Supervised Deep Self-Organising Maps

The supervised deep SOM feature selection is based mostly on the forward-backward idea. Forward greedy feature selection algorithms are based on a greedily picking a feature at every step to significantly reduce a cost function. The idea is to progress aggressively at each iteration, and to get a model which is sparse. The major problem of this heuristicis that once a feature has been added, it cannot be removed, i.e. the forward pass can not correct mistakes done in earlier iterations. A solution to this problem would be a backward pass, which trains a full, not a sparse, model, and removes greedily features with the smallest impact on a cost function. The backward algorithm on its own is computationally quite expensive, since it starts with a full model [43]. We propose a hierarchical feature selection scheme with SOM which is drafted as Algorithm 2. The features in the backward step are drawn randomly.

Signatures of Metabolic Health

The biomedical problem of our interest is a real problem which is a binary classification of obese patients. The aim is to stratify patients in order to choose an efficient appropriate personalized medical treatment. The task is motivated by a recent French study [46] of gene-environment interactions carried out to understand the development of obesity.
It was reported that the gut microbial gene richness can influence the outcome of a dietary intervention. A quantitative metagenomic analysis stratified patients into two groups: group with low gene gut flora count (LGC) and high gene gut flora count (HGC) group. The LGC individuals have a higher insulin resistance and low-grade inflammation, and therefore the gene richness is strongly associated with obesity-driven diseases. The individuals from a low gene count group seemed to have an increased risk to develop obesity-related cardiometabolic risk compared to the patients from the high gene count group. It was shown [46] that a particular diet is able to increase the gene richness: an increase of genes was observed with the LGC patients after a 6-weeks energy-restricted diet. [19] conducted a similar study with Dutch individuals, and made a similar conclusion: there is a hope that a diet can be used to induce a permanent change of gut flora, and that treatment should be phenotype-specific. There is therefore a need to go deeper into these biomedical results and to identify candidate biomarkers associated with cardiometabolic disease (CMD) risk factors and with different stages of CMD evolution.

Dataset description

The MicrObese corpus contains meta-data, genes of adipose tissue, and gut flora metagenomic data. For each patient, we have the information to which class he or she belongs.
There are two classes, high gene count (HGC) and low gene count (LGC) classes. Therefore, our problem is a binary prediction task from heterogeneous data. In general, 49 patients have been hired and examined at the Pitié-Salpêtrière Hospital hospital, Paris, France [46], but as to the genes of the adipose tissue, we faced the problem of missing data, and not for all patients their class, LGC or HGC is provided. We decided to impute missing data by median values for the adipose tissue data. The patients who were not clearly stratified into the LGC or HGC group, were excluded from the analysis. Therefore, in our experiments we have access to 42 observations (patients). To get rid of important noise, after the discussion with pre-clinical researchers, we run a significance test (Kruskal-Wallis), and we keep those variables for which the raw (not adjusted for the multiple hypothesis testing) p-values < 0.05.
Figure II.1 is a hierarchical structure based on SOM. Each upper layer is constructed from variables which are the closest ones to the unit centers of the previous level. Here we also perform data integration. We carry out feature extraction for four data sources – metagenomic species, environmental data, host variables, and genes expressions for adipose tissue. We do feature selection separately for each data source (three layers).
Then we integrate all selected variables in one analysis and obtain a mixed signature (also three layers). Taking into consideration that we would like to get a well-balanced signature, where each data type is presented by some features, the SOM of the lower levels of the hierarchy are constructed per data source, since the number of parameters are extremely different in, e.g., adipose tissue data and in the block of environmental variables. Although Figure II.1 provides a schematic overview, the maps on the figure are exactly what we get in our experiments. It is interesting to see that lower levels where the number of parameters is quite big, do not reveal specific structures in data.
The highest levels, on the contrary, show well-organized clusters. Figure II.2 illustrates the quantization error associated with hierarchies on each data sources and on the mixed hierarchy. It is easy to see that in all cases the quantization error diminishes. Figure II.3A
illustrates the patients separation after the feature selection, where 1 stands for high gene count patients, and 2 for the low gene count ones. Note that each cluster may contain several patients.
The framework of Figure II.1 can be applied to the whole MicroObese cohort, both to the HGC and to the LGC data points (we do the 10-folds cross validation in all our classification experiments), but we can also split the data into the HGC and LGC data sets, and extract signatures for each group. These results that can be found on Figure II.4A and B are very interesting for clinicians and researchers doing pre-clinical research, since these signatures allow them to better characterize the groups of patients. Figure II.4C shows the result of the prediction using the HGC and LGC groups.
The signature, therefore, characterizes the discrimination between two classes. It is a well-reported fact that biological and medical signatures are rather unstable. See, for instance, [75], where a comparison of more than thirty feature selection methods has been made, and where it has been shown that the stability of modern state-of-the-art approaches is rather low.

READ Nonstationary filtered shot-noise processes and applications to neuronal membranes

Abundance Bins for metagenomic synthetic images

In order to discretize abundances, choose a color for them, and construct synthetic images, we use different methods of binning (or discretization). On the artificial images, each bin is illustrated by a distinct color extracted from a color strip of heat map colormaps in Python library such as jet, viridis, etc. In [38], authors stated that viridis showed a good performance in terms of time and error. The binning method we used in the project is unsupervised binning which does not use the target (class) information. In this part, we use EQual Width binning (EQW) with ranging [Min, Max]. We test with k = 10 bins (for color distinct images, and gray images), width of intervals is w = 0.1, if Min=0 and Max = 1, for example.

Binning based on abundance distribution

On Figure III.2A, on the left we show the histogram of the original data. The original data follow the zero-inflated distribution what is typical for metagenomic data [186]. On the right we show the log-transformed distribution of the data where the logarithm (base 4) is taken, and we notice that the data are more normally-distributed. In the logarithmic scale, the width of each break is 1 being equivalent to a 4-fold increase from the previous bin. As observed from Figure III.2B, histograms of six datasets of group A with the logarithmic scale (base 4) share the same distributions. From our observations, we propose a hypothesis that the models will perform better with such breaks owning
values of breaks from 0, 10−7, 410−7, 1.610−6, …, 0.0065536, 1. The first break is from 0 to 10−7 which is the minimum value of species abundance known in 6 datasets of group A, each multiplies four times preceding one. We called this binning “SPecies Bins” (SPB) in our experiments.
In order to evaluate the efficiency of SPB, we compare the performance between SPB and EQW in Table III.2. EQW bins are performed in a range of [Min,Max]. For each k fold, EQW indicates maximum and minimum values of original abundances in training set, then dividing k = 10 bins that have the equal width. The bins determined using training set, after that, we apply these bins to test set. We observed that the SPB outperforms the EQW in most of the datasets. Prominently, the EQW exhibits a great performance in CNN compared to FC model (see the models in IV.4.3). The EQW using CNN performs slightly better than the SPB on the OBE dataset.

AlexNet, ImageNet Classification with Deep Convolutional Neural Networks

Alex et al. in [41] presented a deep and big CNN (including 60 million parameters) which performed on a huge dataset with 1.2 million color images of 224×224 in ImageNet LSVRC-2010. The architectures consisted of 5 convolutional layers and 3 fully connected layers. They used some novel and unusual features for their architecture:
• ReLU Nonlinearity: The authors stated that the non-saturating nonlinearity f(x) = max(0, x) (referred as Rectified Linear Units (ReLUs) [86]) makes the learning faster than the saturating nonlinearities such as Tanh f(x) = tanh(x) or Sigmoid f(x) = 1 (1+e−x) . An experiment that was to compare the training time between ReLU and Tanh showed that the network with ReLUs was six times faster than the network with Tanh.
• Using multi-GPUs: Due to limitations of GPU’s memory (only 3GB for a single GTX 580 GPU) and the large training examples (1.2 million images of 224×224), one GPU was not enough to fit dataset. Hence, authors distributed the network to both 2 GPUs (Each GPU was installed half of the kernels/neurons and communicated in some specific layers).
• Local Response Normalization: The authors observed the local normalization scheme that was able to improve generalization.
• Overlapping Pooling: The traditional pooling was z × z with step = z that usually appears in commonly CNN. In the proposed architecture, the authors used the pooling of 3×3 with step=2 that reduced the error rates by 0.4%. In overall architecture (Figure IV.3), the first convolutional layer contained 96 filters of 3×11×11 with a stride of 4 pixels. The second convolutional layer included 256 kernels of 48×5×5 (on each GPU). After performing max pooling, the output of feature maps was 27×27. The third, fourth, fifth convolutional layer both had the size of 3×3 with the number of filter 384, 384, 256 respectively. There were 4096 neurons for each fully-connected layer. Softmax in the last layer was to classify 1000 class labels. To reduce overfitting for a huge convolutional neural network with a high number of parameters (60 million parameters), authors applied two primary approaches including Data Augmentation and Dropout [28, 55]. Two fully connected layers were applied dropout with probability 0.5. It took a double of iterations to converge when applying dropout.

Table of contents :

Acknowledgements
Abstract
Résumé
I Introduction
I.1 Motivation
I.2 Brief Overview of Results
I.2.1 Chapter II: Heterogeneous Biomedical Signatures Extraction based on Self-Organising Maps
I.2.2 Chapter III: Visualization approaches for metagenomics
I.2.3 Chapter IV: Deep learning for metagenomics using embeddings
II Feature Selection for heterogeneous data
II.1 Introduction
II.2 Related work
II.3 Deep linear support vector machines
II.4 Self-Organising Maps for feature selection
II.4.1 Unsupervised Deep Self-Organising Maps
II.4.2 Supervised Deep Self-Organising Maps
II.5 Experiment
II.5.1 Signatures of Metabolic Health
II.5.2 Dataset description
II.5.3 Comparison with State-of-the-art Methods
II.6 Closing and remarks
III Visualization Approaches for metagenomics
III.1 Introduction
III.2 Dimensionality reduction algorithms
III.3 Metagenomic data benchmarks
III.4 Met2Img approach
III.4.1 Abundance Bins for metagenomic synthetic images
III.4.1.1 Binning based on abundance distribution
III.4.1.2 Binning based on Quantile Transformation (QTF)
III.4.1.3 Binary Bins
III.4.2 Generation of artificial metagenomic images: Fill-up and Manifold learning algorithms
III.4.2.1 Fill-up
III.4.2.2 Visualization based on dimensionality reduction algorithms
III.4.3 Colormaps for images
III.5 Closing remarks
IV Deep Learning for Metagenomics
IV.1 Introduction
IV.2 Related work
IV.2.1 Machine learning for Metagenomics
IV.2.2 Convolutional Neural Networks
IV.2.2.1 AlexNet, ImageNet Classification with Deep Convolutional
IV.2.2.2 ZFNet, Visualizing and Understanding Convolutional Networks
IV.2.2.3 Inception Architecture
IV.2.2.4 GoogLeNet, Going Deeper with Convolutions
IV.2.2.5 VGGNet, very deep convolutional networks for large-scale image recognition
IV.2.2.6 ResNet, Deep Residual Learning for Image Recognition .
IV.3 Metagenomic data benchmarks
IV.4 CNN architectures and models used in the experiments
IV.4.1 Convolutional Neural Networks
IV.4.2 One-dimensional case
IV.4.3 Two-dimensional case
IV.4.4 Experimental Setup
IV.5 Results
IV.5.1 Comparing to the-state-of-the-art (MetAML)
IV.5.1.1 Execution time
IV.5.1.2 The results on 1D data
IV.5.1.3 The results on 2D data
IV.5.1.4 The explanations from LIME and Grad-CAM
IV.5.2 Comparing to shallow learning algorithms
IV.5.3 Applying Met2Img on Sokol’s lab data
IV.5.4 Applying Met2Img on selbal’s datasets
IV.5.5 The results with gene-families abundance
IV.5.5.1 Applying dimensionality reduction algorithms
IV.5.5.2 Comparing to standard machine learning methods
IV.6 Closing remarks
V Conclusion and Perspectives
V.1 Conclusion
V.2 Future Research Directions
Appendices
A The contributions of the thesis
B Taxonomies used in the example illustrated by Figure III.7
C Some other results on datasets in group A
Bibliography