Get Complete Project Material File(s) Now! »

## Learning on transcriptomic data : problem and notations

With the rapid development of transcriptomic technologies, now it is possible to simultaneously track the expression levels of thousands of genes or transcripts (fea-tures) during critical biological processes and across collections of related samples. However, the significant number of features and the complexity of biological net-works account for the challenges of understanding and interpreting the result of such massive data. In this section, we introduce the problem and notations in learning on transcriptomic data; these notations are also applied uniformly in the whole thesis.

Transcriptome sequencing data is converted into a gene expression measure (see Chapter 2) and stored into an observed expression matrix x with p rows and n columns where p is the number of features/variables (genes or other relevant fea-tures) and n is the number of samples/observations. The index of each feature and each sample are indicated by k and i, respectively (k = 1; : : : ; p; i = 1; : : : ; n). As a result, xi is an observed vector of expression for sample i across p features (size of the vector is p), while x(k) is an observed vector of expression for feature k across Figure 1.1: General flowchart for transcriptome analysis. Both microarray and RNA-seq technologies are undergone two parts. A. Data processing The raw data from microarray experiments are obtained in a bunch of images. To turn these images into probe-level values involves several steps: image analysis, background substraction, normalization and summarization. Meanwhile, RNA-seq data often stores in a list of FASTQ files, experiences processing with quality control and trimming, mapping, and assembly. B. Application Microarray and RNA-seq data are applied in various applications: differential expression analysis, survival anal-ysis and transcriptome classification.

n samples (size of the vector is n). Then, x(ik) is an observed expression for feature

k in sample i. A random variable modeling the expression for feature k in sample i is represented by Xi(k). Finally, y is a vector of labels for the n samples, and yi is the label of sample i.

### Differential gene expression (DGE) analysis

Identifying genes that show diﬀerences in expression level between conditions is the most popular use of transcriptome profiling. For example, in order to assess the eﬀect of a drug, we may ask which genes are up-regulated (increased in the expression) or down-regulated (decreased in the expression) between treatment and control groups.

**Hypothesis testing**

Assume we want to detect DE genes between two conditions (1 and 2) based on their expression table. Statistical tests address this task by providing a mechanism for making quantitative decisions. To make a decision, statistical tests evaluate the evidence that the data provides against an hypothesis. This hypothesis is called null hypothesis (labeled H0). In case of gene expresssion anaylsis, the statement of H0 is: The mean expression between the two conditions is equal. The statement of alternative hypothesis (labeled H1 or Ha) is: The mean expression between the two conditions is different. To test whether gene x(k) is a diﬀentially expressed gene between two conditions, we apply the procedure is described below:

1. Model the expression of gene k using a random variable X(k), with means 1 and 2 in condition 1 and condition 2 respectively.

2. Formalize the H0 and H1 hypotheses:

H0: The mean expression between the two conditions is equal ( 1 = 2)

H1: The mean expression between the two conditions is different ( 1 6= 2)

3. Setup the significance level, , defined as the probability of rejected H0 given that H0 is true. The significance level is used in step 5 to take the final decision.

4. Fit the model and estimate the parameters of the random variable X(k) for each condition.

5. Computing the value of the test statistic on the observed data x(k), and the probability of obtaining this value or a more extreme value when H0 is true (called P-value). For example, P-value = 0:001 means that the probability of seeing the experiment outcomes as extreme or more extreme than the observed data is one in 1000 when the mean expression between two conditions is equal, i:e:; the H0 is true.

6. Determine to reject or not rejet H0 based on the P-value and . Finally, if P-value > , the evidence against H0 is statistically significant, therefore a test statistic gives a decision for rejecting H0. Conventionally, one often chooses the level of significance equal to 5% or 1% or 0.1%.

In a hypothesis testing, one can make two type of errors.

1. Type I error or false-positive: The test statistic rejects the null hypothesis while it is really true. For example, the gene k is not a diﬀerentially expressed but the test statistic states that this gene is diﬀerentially expressed. As a result, type I error introduces a false discovery.

2. Type II error or false-negative: conversely, the statistical test accepts the null hypothesis while it is really false.

Although type I and type II errors cannot be entirely avoided, test statistics control the probability of generating type I errors through the significance level .

**Multiple testing**

Conducting a single statistical test for each gene has several limitations, the most important is that a large number of hypothesis tests are performed, potentially introducing a substantial number of falsely significant results. For instance, say we have 20 null hypotheses to test simultaneously and a given = 0.05, i:e: the probability of making a type I error is 5% for each individual test. Therefore, the chance of generating at least 1 false-positive when performing 20 tests is calculated as follows:

P (making at least 1 error in 20 tests) = 1 P (not making an error in 20 tests) = 1 (1 0:05)20 0:64

Thus, if the 20 tests are independent then the chance of generating at least one incorrect rejection (so-called the family-wise error rate or FWER) is a round 64%, even there is no significant diﬀerences to detect. That would be a serious problem in the case of RNA-seq experiments, where we have to process tens to hundreds of thousands of tests.

Several methods have been introduced to deal with multiple testing with the aim of adjusting the so that the probability of making at least one significant result by chance is still lower than the significance level.

Benjamin-Hochberg adjustment controls False Discovery Rate (FDR). This method, introduced by Benjamini and Hochberg (1995), aims to control the propor-tion of falsely rejected hypotheses, i:e: controlling FDR. Benjamin-Hochberg (BH) procedure was implemented step by step as described below.

1. Conduct all statistical tests in m hypothesis tests and extract the correspond-ing p-value for each test.

2. Sort these p-values in ascending order assigning a rank for each p-value, start-ing from 1.

3. Calculate the BH critical value for each individual p-value, as mi Q, where i is the rank of p-value, while Q is the desired proportion FDR.

4. Find the largest p value that is lower than its BH critical value.

5. Finally, all p-values lower than this p value are considered significant.

The Benjamin-Hochberg has been designed to work for independent tests, although it works in practice on dependent tests.

Suppose we conduct 20 hypothesis tests (m = 20) for about 500 genes with our desired False Discovery Rate of 0.2 (Q = 0.2). Table 1.1 below shows the five genes with the lowest p-value. We calculate the BH critical value for each gene as presented in column 4.

The bold p-value (gene 4) is the highest p-value that is lower than its BH critical value (i:e:; 0:036 < 0:04). As a result, all genes that have a p-value lower than 0.036 are considered significant. Note that the p-value of gene 2 also is smaller than its BH value. However, it is not the highest value among all p-values that justify this criterion.

#### Models for differential gene expression data

The first step of the statistical test is the choice of the a probabilistic model for the expression data. Microarrays have been used systematically for diﬀerential expression for over three decades, and quite a few well-established methods are developed for this purpose, such as limma (Smyth, 2004) based on the normal distribution. Unfortunately, because of the diﬀerence between the data obtained from microarrays and RNA-seq, these methods cannot be directly applied to RNA-seq data. The expression levels of microarray data are represented as continuous intensity hybridization signals; in contrast, these measurements in RNA-seq data are treated as discrete counts. Microarray data, as a result, commonly assumed to follow a normal distribution (see Figure 1.2), while the Poisson and the negative binomial (NB) distributions are two most suitable for modeling non-negative data in an RNA-seq experiment (Wang et al., 2010; Auer and Doerge, 2011; Di et al., 2011).

However, the assumption of Poisson distribution for the read counts is too tight be-cause it does not reflect the biological variations in the data (Robinson and Smyth, 2007; Nagalakshmi et al., 2008). This disadvantage is derived from the simplicity of Poisson distribution; it assumes that the variance of the model is equal to the mean. Ignoring the biological replicates so-called over-dispersion problem the sta-tistical analysis does, therefore, not control the false-positive rates because of the underestimation of sampling error (Anders and Huber, 2010). To deal with this problem, the NB distribution as a replacement for Poisson distribution in modeling count data. The NB distribution is a family with two parameters, variance and mean, which the former is greater than the later (Robinson et al., 2010; Anders and Huber, 2010; Love et al., 2014). Another alternative is the transformation of the RNA-seq data using a simple logarithm, a more complex variance stabiliz-ing transformation (Anders and Huber, 2010) or the regularized logarithm (Love et al., 2014). The voom and trend transformations, proposed in Law et al. (2014), unlock the use of models developed for microarrays to RNA-seq (Ritchie et al., 2015). Abundance of software supports statistical tests for detecting diﬀerentially expressed genes based on the distribution assumption of RNA-seq count data : DEGseq (Wang et al., 2010) based on Poisson distribution, DESeq (Anders and Hu-ber, 2010), DESeq2 (Love et al., 2014), edgeR (Robinson et al., 2010) based on NB distribution, and limma (Law et al., 2014) based on normal distribution. One should consider normalization before performing statistical analysis. It is an essen-tial procedure designed to identify and correct technical biases was presented due to library preparation protocols and sequencing platforms. Normalization has a great impact on diﬀerential expression results (Dillies et al., 2013; Bullard et al., 2010), even more than the selection of test statistic applied in hypothesis tests for DGE analysis. Some classical procedures for normalization of RNA-seq will be presented in Section 2.1.3.

#### Supervised learning methods

Supervised learning is an algorithmic process that learns a function mapping input to output based on example input-output pairs. Therefore, the essential goal of supervised learning is to best approximate the mapping function, and then when one has new observations, this model can predict the output variables of these data. According to the type of output data that models have to forecast, supervised learning is typically classified into classification, and regression, which are used for predicting categorical and continuous outcomes, respectively.

**Traditional supervised learning models**

We want a model f to predict yi based on xi : y^i = f(xi). The value y^i is the predic-tion for sample i. Among numerous supervised machine learning (ML) algorithms, we present some classical models, such as linear regression, logistic regression, Naïve Bayes classifier.

**Table of contents :**

**I Introduction **

**1 Biostatistics for transcriptome analysis **

1.1 History of transcriptome and transcriptomics

1.1.1 Microarrays and RNA-seq

1.1.2 Learning on transcriptomic data : problem and notations

1.2 Differential gene expression (DGE) analysis

1.2.1 Hypothesis testing

1.2.2 Multiple testing

1.2.3 Models for differential gene expression data

1.3 Supervised learning methods

1.3.1 Traditional supervised learning models

1.3.2 Cross-validation

1.3.3 Overfitting and regularization

1.3.4 The curse of dimensionality

1.3.5 Feature selection for supervised learning

1.3.6 Supervised learning for prediction of binary variables

1.3.7 Survival analysis

1.3.8 Interpreting supervised learning models

1.3.9 Supervised learning on gene expression data

1.3.10 Differential analysis versus supervised learning

1.4 Unsupervised learning methods

1.4.1 Traditional clustering techniques

1.4.2 Evaluation of unsupervised clustering

1.4.3 Clustering approaches for gene expression data

**2 Bioinformatics for RNA-seq analysis **

2.1 Conventional RNA-seq analysis

2.1.1 Quality control

2.1.2 Alignment/Mapping

2.1.3 Quantification

2.2 k-mer strategies

2.3 Reference-free approaches for RNA-seq analysis

2.3.1 DE-kupl: exhaustive capture of biological variation using k-mers

2.3.2 GECKO: a genetic algorithm to classify samples using k-mers

2.3.3 MINTIE: reference-free approach combining de novo assembly and differential analysis

2.4 Conclusion

**3 Transcriptomics of Prostate Cancer **

3.1 General introduction to Prostate Cancer

3.2 Diagnostic and Prognostic of Prostate Cancer

3.3 Supervised learning on reference-based approach for PCa

3.4 Conclusion

**4 Challenges and contributions**

4.1 Adapting tools to the dimensionality of datasets generated by genefree approaches

4.2 Combining k-mer based reference-free approach and predictive models

4.3 Demonstrating the ability of gene-free approaches to discover unreferenced RNA subsequences

4.4 Measuring reference-free signatures across independent RNA-seq datasets

**II Results **

**5 Methods for dimension reduction in k-mer analysis **

5.1 Introduction

5.2 Filtering k-mers based on their counts

5.2.1 Filtering strategies

5.2.2 Metrics to evaluate filtering performance

5.2.3 Experiments and Results

5.2.4 Conclusion on count-based filtering

5.3 Clustering strategies

5.3.1 Strategies to pre-compute distances for DBSCAN clustering

5.3.2 Experiments and Results

5.3.3 Conclusion of the clustering analysis

5.4 Discussion

**6 Reference-free transcriptome exploration reveals novel RNAs for Prostate cancer diagnosis **

6.1 Discovery of DE-kupl contigs associated to Prostate cancer

6.2 Selection of predictive DE-kupl contigs

6.3 Measuring DE-kupl contigs in an independent cohort

6.4 Performance of DE-kupl predictive contigs in an independent cohort

6.5 Comparing the gene-free classifier vs conventional gene-based classifier

6.6 Discussion

**7 A Comparative Analysis of Reference-Free and Conventional Transcriptome Signatures for Prostate Cancer Prognosis **

7.1 Introduction

7.2 Materials and Methods

7.2.1 Data acquisition and outcome labelling

7.2.2 A generic framework to infer reference-based and referencefree signatures

7.2.3 Gene and k-mer count matrices

7.2.4 Reduction of k-mer matrix via contig extension

7.2.5 Count normalization

7.2.6 Univariate features ranking

7.2.7 Feature selection, model fitting and predictor evaluation

7.2.8 Matching signature contigs in the validation cohort

7.3 Results

7.3.1 A reference-free risk signature for prostate cancer

7.3.2 Relapse signatures contain key PCa drivers

7.3.3 Relapse signatures do not accurately classify independent cohorts

7.4 Discussion

7.4.1 Properties of reference-free signatures

7.4.2 Performances and generalization issues

7.5 Conclusion

7.6 Acknowledgements

**III Discussion **

**8 Discussion and perspectives **

8.1 Applying unsupervised filtering methods with contigs extension count data

8.2 Other unsupervised learning algorithms for clustering k-mers

8.3 The characteristics of reference-free signatures

8.4 Performances of reference-free signatures

Résumé en français

Acronyms