Cancer Genomics

Get Complete Project Material File(s) Now! »

Overview of NetNorM

NetNorM takes as input an undirected gene network and raw exome somatic mutation pro les and outputs a new representation of mutation pro les which allows better survival prediction and patient strati cation from mutations (Fig. 2.1). Here and in what follows, the \raw » mutation pro les refer to the binary patients times genes matrix where 1s indicate non-silent somatic point mutations or indels in a patient-gene pair and 0s indicate the absence of such mutations. The new representation of mutation pro les computed with NetNorM also takes the form of a binary patients times genes mutation matrix, yet with new properties. While di erent tumours usually harbour di erent number of mutations, with NetNorM all patient mutation pro les are normalised to the same number k of genes marked as mutated. The nal number of mutations k is the only parameter of NetNorM, which can be adjusted by various heuristics, such as the median number of mutations in the original pro les, or optimised by cross-validation for a given task such as survival prediction. In order to represent each tumour by k mutations, NetNorM adds \missing » mutations to samples with less than k mutations, and removes \non-essential » mutations from samples with more than k mutations. The \missing » mutations added to a sample with few mutations are the non-mutated genes with the largest number of mutated neighbours in the gene network, while the \non-essential » mutations removed from samples with many mutations are the ones with the smallest degree in the gene network. These choices rely on the simple ideas that, on the one hand, genes with a lot of interacting neighbours mutated might be unable to ful l their functions and, on the other hand, mutations in genes with a small number of interacting neighbours might have a minor impact compared to mutations in more connected genes.
In this study, we compare NetNorM-processed pro les with the raw mutation data and with pro les processed with network smoothing (NS) [Zhou et al., 2004] (also called network di usion, or network propagation) followed by quantile normalisation (QN) as implemented in [Hofree et al., 2013]. We refer to this method as NSQN below. Mutation pro les, either raw or processed with NetNorM or NSQN, are restricted to the genes present in the network used. While both NetNorM and NSQN leverage gene network prior knowledge to enhance mutation data, the two methods have fundamental di erences. First, NetNorM leverages information about rst neighbours in the network only while NSQN spreads mutation information at a more global scale on the gene network. Second, with NetNorM the normalised pro les all have the same value distribution by construction, since they are all binary vectors with k ones, removing the need for further quantile normalisation which, as we discuss below, is critical for NSQN.

Survival prediction

NetNorM provides state-of-the-art prognosis for patient survival based on mutation pro les

To assess the relevance of NetNorM, we rst explore the capacity of somatic mutations to predict patient survival. We collected a total of 3,278 full-exome mutation pro les of 8 cancer types from the TCGA portal (Table 2.1), censored survival information and clinical data. In parallel we retrieved a gene network to be used as background information for NSQN and NetNorM : Pathway Commons, which integrates a number of pathway and molecular interaction databases [Cerami et al., 2010]. For each cancer type, we use these data to assess how well survival can be predicted from somatic mutations. For that purpose, we perform survival prediction with a sparse survival SVM (see Methods) using either the raw mutation pro les or the pro les processed with NSQN or NetNorM, respectively, and assess their performance by cross-validation using the concordance index (CI) on the test sets as performance metric.
Figure 2.2 summarises the survival prediction performances for the 8 cancer types, when the sparse survival SVM is fed with the raw mutation pro le, or with the mutation pro les modi ed by NSQN or NetNorM using Pathway Common as gene network. For two cancers (LUSC, HNSC), none of the methods manages to outperform a random prediction, questioning the relevance of the mutation information in this context. For OV, BRCA, KIRC and GBM, all three methods are signi cantly better than random, although the estimated CI remains below 0:56, and we again observe no signi cant di erence between the raw data and the data transformed by NSQN or NetNorM. Finally, the last two cases, SKCM and LUAD, are the only ones for which we reach a median CI above 0:6. In both cases, processing the mutation data with NetNorM signi cantly improves performances compared to using the raw data or pro les processed with NSQN. More precisely, for LUAD the median CI increases from 0:56 for the raw data and 0:53 for NSQN to 0:62 for NetNorM. In the case of SKCM, the median CI increases from 0:48 for the raw data to 0:52 for NSQN, and to 0:61 for NetNorM. For SKCM, both NetNorM and NSQN are signi cantly better than the raw data (P < 0:01).
In our experiments, silent mutations are systematically ltered out. To evaluate whether this preprocessing step is actually detrimental or bene cial for the survival prediction task, we performed further experiments where silent mutations are not ltered out (Fig. A.1). We nd that considering silent mutations does not improve survival prediction performances compared to the case where they are ltered out. In fact, the performance of NetNorM on LUAD is signi cantly decreased when silent mutations are taken into account.
To assess the in uence of the gene network used on the survival prediction performances, we also repeated our experiments with four gene networks instead of Pathway Commons: BioGRID [Chatr-aryamontri et al., 2016] , HPRD [Prasad et al., 2009], HumanNet [Lee et al., 2011] and STRING [Szklarczyk et al., 2015] (Fig. A.2). For HumanNet and STRING, only the 10% most con dent interactions were retained. We observe that no gene network clearly stands out as the best network for all cancers. For two cancers, LUSC and HNSC, performances remain very low, close to a concordance index of 0:5, whatever the method or network used. For three cancers, OV, BRCA and KIRC, NetNorM is the only method to signi cantly outperform the raw data with at least one network (HumanNet and STRING for OV, HPRD for BRCA, and STRING for KIRC) with a median concordance index above 0:55. For GBM, NSQN is the only method to outperform the raw data (with HumanNet and STRING) with a median concordance index above 0:55. For the two remaining cancers, LUAD and SKCM, the best performances are those obtained with NetNorM using Pathway Commons, with median CI of 0:62 and 0:61 respectively. Across all cancers, methods, and networks combinations, these two cases are the only ones where the median CI obtained exceeds 0:60.
Finally, as mutations in some genes are known to be associated with survival, such as TP53 in BRCA and HNSC which is associated with worsened survival [Robles and Harris, 2010], we evaluate the prediction ability of individual genes’ mutation status. For each cross-validation fold, the gene giving the best concordance index on the training set is selected and its performance evaluated on the test set. We nd that for 5 cancers, the performances of individual genes are similar to those of the survival SMV applied to the whole raw mutations datasets (Fig. A.3). However for BRCA and HNSC, better survival predictions are obtained using a single gene than the whole raw mutational pro les. Yet these predictions are not better than those obtained with NetNorM. For these two cases, TP53 is the gene selected in the majority of folds (17/20 for HNSC and 19/20 for BRCA), which is in accordance with existing literature (Table A.1). Lastly, the survival SVM applied to the whole dataset yields signi cantly better performances than the single gene approach for LUAD. This means that contrary to the BRCA and HNSC cases, the linear combinations of genes which are found for LUAD have a predictive power that generalises well to unseen data.
In summary, these results show that for at least 6 out of 8 cancers investigated, somatic mutation pro les have a prognostic value, and that for two of them (SKCM and LUAD) it is possible to improve the prognostic power of mutations by using gene networks and to reach a CI above 0:6. In both cases, NetNorM is signi cantly better than NSQN.

The biological information encoded in the gene network contributes to the prognosis

To test whether the biological information contained in the gene network plays a role in the improvement of survival predictions for LUAD and SKCM, we evaluate again NetNorM and NSQN using 10 di erent randomised versions of Pathway Commons for these two cancers. Ran-dom networks were obtained by shu ing the nodes’ labels of the real network while keeping the structure unchanged. The results, shown on Fig. 2.3, demonstrate that NetNorM performs signi cantly better with a real network. More precisely, the real network signi cantly outper- forms all random networks for SKCM and 8 out of 10 random networks for LUAD (Wilcoxon signed-rank test with correction for multiple hypothesis testing , FDR 5%). NSQN also per-forms signi cantly better with a real network for SKCM (7 out of 10 cases) but not for LUAD (0 out of 10 cases). This last observation is not surprising since NSQN does not improve over the raw data for LUAD, which suggests that the method may have failed to leverage network information in this case. In summary, these results indicate that the improvements obtained with NetNorM and NSQN compared to the raw data do rely on biological information encoded in the network.

Analysis of predictive genes

In order to shed light on the reasons why NetNorM outperforms the raw data and NSQN on survival prediction for SKCM and LUAD, we now analyse more nely the normalisation carried out by NetNorM on the mutation pro les, and why they lead to better prognostic models. For that purpose, we focus on the genes that are selected at least 50% of the times by the sparse survival SVM during the 20 di erent train/test splits of cross-validation, after NetNorM normalisation. This leads to 21 frequently selected genes for LUAD and 10 for SKCM (Fig. 2.4). Remembering that NetNorM either removes mutated genes for patients with many mutations, or adds proxy mutations for patients with few mutations, we can assess for each frequently selected gene whether it tends to exhibit proxy mutations or whether it tends to be actually mutated in the tumour. This is done by comparing how frequently it is marked as mutated on the raw data and after NetNorM normalisation (Fig. 2.4, top plot). For both cancers, we observe two clearly distinct groups of frequently selected genes: those that concentrate proxy mutations (which we will call proxy genes, in red in Fig. 2.4), and those to which NetNorM brings only few modi cations compared to the raw data, meaning they are usually actually mutated in the tumours (in black in Fig. 2.4).

Genes with few modi cations imputed by NetNorM

In the case of LUAD, 12 out of the 21 selected genes are non-proxy genes, meaning they tend to be really mutated when they are marked as mutated after NetNorM normalisation. Interestingly, mutations in 5 of these genes are predictive of an increased survival time (corresponding to a positive coe cient in the sparse survival SVM) while mutations in the remaining 7 genes are predictive of a decreased survival time (corresponding to a negative coe cient) (Fig. 2.4, bottom plot). The three most important predictors according to their frequency of selection include NOTCH4, TP53 and CRB1 (selected in all of the 20 folds) and are all predictive of a decreased survival time. TP53 is a well-known cancer gene and has been reported as signi cantly mutated in LUAD [Collisson et al., 2014; Ding et al., 2008]. NOTCH4 is part of the NOTCH signalling pathway which has been widely implicated in cancer and shown to act as both oncogene or tumour suppressor depending on the context [Ranganathan et al., 2011]. Finally, CRB1 is known to localise at tight junctions but little is known about its role in carcinogenesis [Roh et al., 2002]. Among the remaining genes, LAMA2 (selected in 16 out of 20 folds) has been detected as a driver gene in head and neck squamous cell carcinoma and PCDH18 (selected in 11 out of 20 folds) has been detected as a driver in bladder carcinoma, cutaneous melanoma and in a pan-cancer analysis setting [Gonzalez-Perez et al., 2013]. In the case of SKCM, 9 out of the 10 selected genes are genes with few modi cations. This includes 7 genes whose mutations are predictive of a decreased survival time (FLNC, IQGAP2, NPC1L1, NCOA3, LRBA, DSP, PRRC2A), and 2 whose mutations are predictive of an increased survival time (SACS and APOB). Among these genes, NCOA3 (also known as AIB1 or SRC3 ) is an important oncogene in breast cancer [Anzick, 1997; Lahusen et al., 2009]. Its role in other cancers is unclear however it has been shown that overexpression of NCOA3 is a marker of melanoma outcome [Rangel et al., 2006]. LRBA interacts with multiple important signal transduction pathways including EGFR and its deregulation in several cancer types has been shown to facilitate cancer cell growth [Wang et al., 2004]. Moreover LRBA expression has been indicated as a clinical outcome predictor in breast cancer [Andres et al., 2013]. Filamin C (FLNC, selected in all of the 20 folds) is a large actin-cross-linking protein which has been shown to inhibit proliferation and metastasis in gastric and prostate cancer cell lines [Qiao et al., 2014]. Desmoplakin (DSP) is required for functional desmosomal adhesion which has been linked to cancer cells development and progression in several cancers [Chidgey and Dawson, 2007; Dusek and Attardi, 2011]. Moreover IQGAP2 has been identi ed as a tumour suppressor gene in hepatocellular carcinoma, gastric and prostate cancers [Xie et al., 2015].

Proxy genes

In addition to somatically mutated genes, several proxy genes, mutated by the NetNorM proce-dure, are often selected by the survival model. The proxy genes for LUAD are IGF2BP2, RPS9, SMARCA5, MCM4, KHDRBS1, PSMD12, SKIV2L2, FN1, RPL19 and for SKCM UBC is the only one. These genes are among the biggest hubs in the network. This is expected as proxy mutations are imputed in genes with a lot of mutated neighbours, which is more likely to occur for genes that simply have a lot of neighbours. The fact that these proxy genes were selected in the survival models means that they have some prognostic power. In particular for LUAD, the better prediction performances achieved by NetNorM compared to the raw data is largely explained by better predictions made for the half of patients with fewer mutations, and therefore by the proxy mutations that were created in these patients (Fig. 2.5a).
The prognostic power of proxy genes in NetNorM comes from at least two types of information they capture. The rst type of information captured by proxy mutations is the total number of mutations in a patient. Patients harbouring proxy mutations are signi cantly less mutated than those without proxy mutations (Welsh t-test, P 1 10 2) in a given proxy gene. This results from the fact that patients with few mutations receive as many proxy mutations as needed to reach the target number of mutations k, and therefore proxy mutations have a higher probability to be set in patients with few mutations. The fact that NetNorM creates proxies for the total number of mutations raises the question of whether or not the total number of mutations can improve survival predictions made using the raw binary mutation pro les. To answer this question, we trained a model to predict survival from the raw binary mutation pro les concatenated with a feature, scaled to unit variance, which records the total number of mutations in each patient (Fig. A.4). According to our results, taking into account such a feature does not improve survival prediction performances compared to using the raw data alone. We therefore tested another feature which better mimics the proxies created by NetNorM, which we call ‘proxies’. This feature is equal to the total number of mutations in a patient for patients with less than k mutations, and is equal to 0 otherwise. We trained a survival prediction model on the raw data concatenated with the feature ‘proxies’, scaled to unit variance, where k is chosen by cross-validation. Interestingly, we nd that using such a feature allows to signi cantly improve the results obtained for OV, KIRC and LUAD compared to the raw data alone. In particular, the performances obtained for LUAD are on par with those obtained with NetNorM, suggesting that the feature ‘proxies’ summarises well the information leveraged by NetNorM. However this is not the case for SKCM since considering the feature ‘proxies’ does not improve over using the raw data alone. We draw two conclusions from these observations: rst, NetNorM creates relevant proxies for the total number of mutations which, in combination with the binary mutation pro les, have predictive power; second, such proxies do not entirely explain the performances of NetNorM, at least for SKCM.

Table of contents :

1 Introduction
1.1 Contextual setting
1.2 Cancer Genomics
1.3 GWAS
1.4 Statistical learning
1.5 Learning in high dimension
1.6 Computational challenges
1.7 Contributions
2 NetNorM
2.1 Introduction
2.2 Overview of NetNorM
2.3 Survival prediction
2.4 Patient stratication
2.5 Discussion
2.6 Materials and Methods
3 Supervised Quantile Normalisation
3.1 Introduction
3.2 Quantile normalisation (QN)
3.3 Supervised quantile normalisation (SUQUAN)
3.4 SUQUAN as a matrix regression problem
3.5 Algorithms
3.6 Experiments
3.7 Discussion
4 WHInter
4.1 Introduction
4.2 Preliminaries
4.3 The WHInter algorithm
4.4 Simulation study
4.5 Results on real world data
4.6 Related work
4.7 Discussion
5 Conclusion
A Supplementaries for NetNorM
B Supplementaries for WHInter
Bibliography