Deep linear support vector machines
The idea to do feature selection using SOM is not new.  introduced a simple greedy heuristic iterative approach for feature selection which includes 4 steps: 1) learn a SOM and label map; 2) if the classes overlap, then add a new feature or replace a feature; 3) if a feature does not improve the separability of the groups, eliminate or replace this feature; 4) retrain and re-label the map. We also propose a feature selection algorithm based on a clustering, and, namely, a SOM. Note, however, that  clusters observations and greed-ily looks for features ameliorating the separation of classes. We, on the contrary, cluster features, and look for best representatives in each feature cluster. Clustering of features has been already considered by [73, 68]. The principal interest was to build classifiers in a semi-supervised manner and to help analysts in their choice of features or attributes. An-other motivation of  was to illuminate relationships between attributes, their relative importance for classification, and to better understand structure in data. Another clus-tering of features was done in .  has introduced an algorithm called FAST which consists of two simple steps: 1) features are clustered (by using graph-theoretic clustering methods); 2) the most representative features somehow strongly related to classes are selected from each cluster to form a subset of new features. This approach is close to our idea. However, we do not estimate any relations to classes while choosing best rep-resentatives from clusters. In this study, we use SOM clustering, however, it is possible to investigate the clustering with medoids for the same purpose. Partitioning around medoids (PAM) is introduced and described in details by ,  . This is another quite eﬃcient and robust clustering, which can be used for a hierarchy construction . In an already classical deep architecture, in convolutional nets, the non-linearities are sig-moid functions, and new representations are learned in supervised mode using gradient descent. The advantages of the convolutional nets and SVM are combined in . Deep structures learn complex mappings by transforming their inputs through multiple layers of nonlinear transformations . There are several motivations to learn such an architecture. It is, first of all, a way to combine supervised and unsupervised learning. Second, there is a number of functions that can be used to compose weakly non-linear transformations.  introduced a multilayer kernel machines, which are based on an iterative procedure, repeated for each layer of a structure: 1) compute principal components in the feature space induced by a nonlinear kernel, and 2) prune components that are less informative from the feature space. Our approach, in its unsupervised mode, is a convolutional net. An interesting parallel between  and us, apart from using SVM, is that SOM is a nonlinear generalization of the PCA. Another avenue of research is controlling structure in data by penalty terms such as lasso-like methods. So,  proposed recently to add some convex constraints to the logistic regression penalized by the L1 norm to produce a sparse model which involves a hierarchy restriction on feature interactions: an interaction is included if one or both features are marginally important. The disadvantage of the method is that the number of features in this approach is very big even for moderate-size applications, since the approach tests all interactions pairwise.
To learn a hierarchical model, a training algorithm has access to n i.i.d. data points. We can either have labeled pairs (Xi; Yi)1≤i≤n , or an unlabeled data set (Xi)1≤i≤n . The Version Tuesday 9th October, 2018, 15:27 input variable or covariate is X ∈ χ , and the class variable is Y ∈ Y , if the problem is supervised. The covariate variables are high-dimensional, and Xi = (Xi,1, …Xi,d) , where d is the dimensionality of the problem. We are interested, in particular, to reduce the number of features in the model, so that the dimensionality of our problem becomes r d , and so that we can carry out a classification task on a much more compact, and probably less noisy, feature space. A deep structure can be learned with an SVM. A version of deep linear SVM which we exploit in our framework, has been introduced by  . The function in the linear case takes the following form (Eq. II.1): L(w) = min 1wT w + C X max(1 − wT xiyi, 0)
The logistic function can be used as an activation function of a deep learning structure. In such a deep supervised learning based on the linear SVM, we have two alternating steps. The forward step: apply logistic regression function to provide probabilistic interpretation and to activate units (weights are fixed). The backward step: compute the gradient presented as eq. (II.3), and update the weights. The gradient is one of the linear SVM, it is convex and diﬀerentiable, and we can apply any standard gradient descent method.
Self-Organising Maps for feature selection
The Self-Organising Map (SOM) is an artificial network associated with the unsupervised learning paradigm . It is famous for its eﬃcient manner to map from a high dimen-sional input space into a more compact space, usually to a two-dimensional output space. The two-dimensional representation is practical for a visualization, since the mapping preserves topological relations between elements on the grid. Moreover, the continuous input space can be mapped into a discrete output space. The SOM belongs to competitive learning methods, since neurons compete to be activated, and, as a result, only one is activated at a time. The winning neuron is called the winner. When the winner is set, all the other neurons have to re-organize themselves. Interestingly, the SOM can be seen as a non-linear generalization of principal component analysis. Given high-dimensional data x ∈ Rd , the connection weights between observations i and the neurons of the grid j can be presented as wj = wij : j = 1, …, K; i = 1; …, n , where K is the number of neurons on the grid. A discriminate function (Eq. II.5)which is widely used, and which we also use Version Tuesday 9th October, 2018, 15:27 in our experiments, is the squared Euclidean distance between an observation x and the weight vector wj, for all j Dj(x) = (xi − wji)2 i=1
The structure learning can be either supervised, or unsupervised. This algorithm includes four steps detailed in Algorithm 1.
Unsupervised Deep Self-Organising Maps
In an unsupervised setting, the feature selection procedure is completely unsupervised, and the algorithm performs only the first step, a forward pass. In this forward pass, we construct a deep structure layer-wise, where each layer consists of the clusters rep-resentatives from the previous level. A natural question which arises is whether such an unsupervised feature selection can be beneficial for a prediction task. Although it is currently impossible to provide a theoretical foundation for it, there is an intuition why a deep unsupervised feature selection is expected to perform and performs better in prac-tice. Real data are always noisy, and a “good” clustering or dimensionality reduction can significantly reduce the noise. If features are tied into clusters of “high quality”, then it is easier to detect a signal from data, and the generalizing classification performance is higher. The hierarchical feature selection plays here a role of a filter, and a filter with multiple layers seems to perform better than a one-layer filter.
Supervised Deep Self-Organising Maps
The supervised deep SOM feature selection is based mostly on the forward-backward idea. Forward greedy feature selection algorithms are based on a greedily picking a feature at every step to significantly reduce a cost function. The idea is to progress aggressively at each iteration, and to get a model which is sparse. The major problem of this heuristic is that once a feature has been added, it cannot be removed, i.e. the forward pass can not correct mistakes done in earlier iterations. A solution to this problem would be a backward pass, which trains a full, not a sparse, model, and removes greedily features with the smallest impact on a cost function. The backward algorithm on its own is computationally quite expensive, since it starts with a full model . We propose a hierarchical feature selection scheme with SOM which is drafted as Algorithm 2. The features in the backward step are drawn randomly.
In this section, we describe our experiments and results on a real rich, and original biomedical data set. To construct the SOMs, we use somtoolbox1 from Matlab. We also use SOM graphics from , .
Signatures of Metabolic Health
The biomedical problem of our interest is a real problem which is a binary classification of obese patients. The aim is to stratify patients in order to choose an eﬃcient appro-priate personalized medical treatment. The task is motivated by a recent French study  of gene-environment interactions carried out to understand the development of obe-sity. It was reported that the gut microbial gene richness can influence the outcome of a dietary intervention. A quantitative metagenomic analysis stratified patients into two groups: group with low gene gut flora count (LGC) and high gene gut flora count (HGC) group. The LGC individuals have a higher insulin resistance and low-grade inflammation, and therefore the gene richness is strongly associated with obesity-driven diseases. The individuals from a low gene count group seemed to have an increased risk to develop obesity-related cardiometabolic risk compared to the patients from the high gene count group. It was shown  that a particular diet is able to increase the gene richness: an increase of genes was observed with the LGC patients after a 6-weeks energy-restricted diet.  conducted a similar study with Dutch individuals, and made a similar con-clusion: there is a hope that a diet can be used to induce a permanent change of gut flora, and that treatment should be phenotype-specific. There is therefore a need to go deeper into these biomedical results and to identify candidate biomarkers associated with cardiometabolic disease (CMD) risk factors and with diﬀerent stages of CMD evolution.
The MicrObese corpus contains meta-data, genes of adipose tissue, and gut flora metage-nomic data. For each patient, we have the information to which class he or she belongs. There are two classes, high gene count (HGC) and low gene count (LGC) classes. There-fore, our problem is a binary prediction task from heterogeneous data. In general, 49 patients have been hired and examined at the Pitié-Salpêtrière Hospital hospital, Paris, France , but as to the genes of the adipose tissue, we faced the problem of miss-ing data, and not for all patients their class, LGC or HGC is provided. We decided to impute missing data by median values for the adipose tissue data. The patients who were not clearly stratified into the LGC or HGC group, were excluded from the analysis. Therefore, in our experiments we have access to 42 observations (patients). To get rid of important noise, after the discussion with pre-clinical researchers, we run a significance test (Kruskal-Wallis), and we keep those variables for which the raw (not adjusted for the multiple hypothesis testing) p-values < 0.05.
Figure II.1 is a hierarchical structure based on SOM. Each upper layer is constructed from variables which are the closest ones to the unit centers of the previous level. Here we also perform data integration. We carry out feature extraction for four data sources
– metagenomic species, environmental data, host variables, and genes expressions for adipose tissue. We do feature selection separately for each data source (three layers). Then we integrate all selected variables in one analysis and obtain a mixed signature (also three layers). Taking into consideration that we would like to get a well-balanced signature, where each data type is presented by some features, the SOM of the lower levels of the hierarchy are constructed per data source, since the number of parameters are extremely diﬀerent in, e.g., adipose tissue data and in the block of environmental variables. Although Figure II.1 provides a schematic overview, the maps on the figure are exactly what we get in our experiments. It is interesting to see that lower levels where the number of parameters is quite big, do not reveal specific structures in data. The highest levels, on the contrary, show well-organized clusters. Figure II.2 illustrates the quantization error associated with hierarchies on each data sources and on the mixed hierarchy. It is easy to see that in all cases the quantization error diminishes. Figure II.3A illustrates the patients separation after the feature selection, where 1 stands for high gene count patients, and 2 for the low gene count ones. Note that each cluster may contain several patients.
The framework of Figure II.1 can be applied to the whole MicroObese cohort, both to the HGC and to the LGC data points (we do the 10-folds cross validation in all our classification experiments), but we can also split the data into the HGC and LGC data sets, and extract signatures for each group. These results that can be found on Figure II.4A and B are very interesting for clinicians and researchers doing pre-clinical research, since these signatures allow them to better characterize the groups of patients.
Figure II.4C shows the result of the prediction using the HGC and LGC groups. The signature, therefore, characterizes the discrimination between two classes. It is a well-reported fact that biological and medical signatures are rather unstable. See, for instance, , where a comparison of more than thirty feature selection methods has been made, and where it has been shown that the stability of modern state-of-the-art approaches is rather low.
Another avenue to analyze signatures, is to construct Bayesian networks and to study the relations between the variables. We carry out feature selection with the deep SOM, and we run a Bayesian network on the selected variables. Figure II.5 reveals the signature relations of the high gene count group and the low gene count group with the Bayesian network. The highest level of the deep SOM structure and the Bayesian networks provide complementary results. If we compare the relations for the HGC group, (Figure II.4A and II.5A), we will see that the SOM clusters and the Bayesian networks quite often provide similar results, however, in some cases they reveal diﬀerent relations between variables of interest. It is interesting, that the number of selected features for the LGC is bigger a better health. (B): Signature of the low gene count group which is associated with higher inflammation. (C): A signature which discriminates high gene count and low gene count groups. than for the HGC. Analyzing Figure Figure II.4B and II.5B, we do the same conclusion: the SOM and the network reveal the same structure in data, with several interesting exceptions. The biomedical analysis of the results is out of scope of this study, and will be done by fundamental biologists and clinicians.
Comparison with State-of-the-art Methods
In this section we show that the proposed feature selection method is eﬃcient compared to the state-of-the art methods such as lasso and elastic net. In our experiments we use the glmnet R package . Figure II.6 shows the 10- folds cross validation error rate on the MicroObese data as a function of the number of non-zero features in the model. We have done unsupervised feature pre-selection with the deep SOM, the structure drafted on Figure II.1. Then we apply the lasso to the pre-selected set of features, and compare the result with the lasso applied to the whole set of parameters (more than 2000 features).
Figure II.6A on the left demonstrates the performance for the lasso applied to all fea-tures, without the unsupervised pre-selection step. On the right of Figure II.6A, we show the performance of our framework. Note that since the pre-selection step is unsupervised, we do not do any overfitting. The accuracy of both methods is comparable, and taking into consideration that the prediction task is quite challenging, the error rate around 0.3
– 0.33 is acceptable. However, the proposed framework applies lasso to a reduced data set, with a hundred of parameters, and not with thousands as the initial set.
Figure II.6B illustrates our result (with the elastic net) which is similar to the one of Figure II.6A. It shows the performance as function of the number of selected features. On the left, we show the error rate for the elastic net (α = 0.5 in the glmnet R package) on all features, and on the right of II.6B, for the proposed approach. The elastic net achieves a higher accuracy than the lasso. As in the previous lasso experiment, we run the elastic net on about 100 variables for our approach, instead of thousands (the result on the left), and we see that the best model is one with about 30 parameters chosen among 100 from the features pre-selected in unsupervised manner. Note that the best model learned from all parameters (on the left of Figure II.6B), but with the comparable error rate, has more, namely, 43 features.
Closing and remarks
Data integration is a challenge, especially in applications where data are high-dimensional, e.g., metagenomics and the metagenomic species, and where the number of observations (patients) is small. We have proposed to reduce dimensionality by a deep approach which is based on SOM, and which learns new compact data layer-wise, in a hierarchical way. We have considered supervised and unsupervised feature selection frameworks, as well as we considered a real data integration challenge. We show that the considered deep SOM approach is eﬃcient on a real medical complex data set, and it is beneficial to combine it with the lasso and the elastic net approaches. The unsupervised feature selection diminishes the computational burden of the standard methods and also leads to the state-of-the art performance. Although the detailed biomedical discussion of the features clustering and of the quality of the obtained signatures is out of scope of this study, and is to be done by biologists doing pre-clinical research, we expect that our framework can help to better stratify patients, develop methods of personalized medicine, and improve diagnosis and prognosis.
Table of contents :
I.2 Brief Overview of Results
I.2.1 Chapter II: Heterogeneous Biomedical Signatures Extraction based on Self-Organising Maps
I.2.2 Chapter III: Visualization approaches for metagenomics
I.2.3 Chapter IV: Deep learning for metagenomics using embeddings
II Feature Selection for heterogeneous data
II.2 Related work
II.3 Deep linear support vector machines
II.4 Self-Organising Maps for feature selection
II.4.1 Unsupervised Deep Self-Organising Maps
II.4.2 Supervised Deep Self-Organising Maps
II.5.1 Signatures of Metabolic Health
II.5.2 Dataset description
II.5.3 Comparison with State-of-the-art Methods
II.6 Closing and remarks
III Visualization Approaches for metagenomics
III.2 Dimensionality reduction algorithms
III.3 Metagenomic data benchmarks
III.4 Met2Img approach
III.4.1 Abundance Bins for metagenomic synthetic images
III.4.1.1 Binning based on abundance distribution
III.4.1.2 Binning based on Quantile Transformation (QTF)
III.4.1.3 Binary Bins
III.4.2 Generation of artificial metagenomic images: Fill-up and Manifold learning algorithms
III.4.2.2 Visualization based on dimensionality reduction algorithms
III.4.3 Colormaps for images
III.5 Closing remarks
IV Deep Learning for Metagenomics
IV.2 Related work
IV.2.1 Machine learning for Metagenomics
IV.2.2 Convolutional Neural Networks
IV.2.2.1 AlexNet, ImageNet Classification with Deep Convolutional Neural Networks
IV.2.2.2 ZFNet, Visualizing and Understanding Convolutional Networks
IV.2.2.3 Inception Architecture
IV.2.2.4 GoogLeNet, Going Deeper with Convolutions
IV.2.2.5 VGGNet, very deep convolutional networks for large-scale image recognition
IV.2.2.6 ResNet, Deep Residual Learning for Image Recognition
IV.3 Metagenomic data benchmarks
IV.4 CNN architectures and models used in the experiments
IV.4.1 Convolutional Neural Networks
IV.4.2 One-dimensional case
IV.4.3 Two-dimensional case
IV.4.4 Experimental Setup
IV.5.1 Comparing to the-state-of-the-art (MetAML)
IV.5.1.1 Execution time
IV.5.1.2 The results on 1D data
IV.5.1.3 The results on 2D data
IV.5.1.4 The explanations from LIME and Grad-CAM
IV.5.2 Comparing to shallow learning algorithms
IV.5.3 Applying Met2Img on Sokol’s lab data
IV.5.4 Applying Met2Img on selbal’s datasets
IV.5.5 The results with gene-families abundance
IV.5.5.1 Applying dimensionality reduction algorithms
IV.5.5.2 Comparing to standard machine learning methods
IV.6 Closing remarks
V Conclusion and Perspectives
V.2 Future Research Directions