Computational methods to unravel tumor evolution from genomic data

Get Complete Project Material File(s) Now! »

Overview of sequencing techniques

A large number of sequencing technologies have been developed over the last two decades, and have been extensively reviewed [Goodwin et al., 2016; Heather and Chain, 2016; van Dijk et al., 2018; Mardis, 2017]. The characteristics of the sequencing have important implications on the genome features that can be detected, and on the specificities of the involved computational pipelines, so we will briefly describe the sequencing techniques landscape. Sanger sequencing can deal with sequences up to 1000 base pairs (bp) with an accuracy as high as 99.999% [Shendure and Ji, 2008], and relies on a complex setting where the polymerization reaction that elongates DNA is supplied di-deoxynucleotidetriphosphates (ddNTPs) instead of regular deoxynucleotidetriphosphates (dNTPs). The incorporation of a ddNTP prevents further elongation, and the resulting DNA molecules are then separated according to their molecular weight (and hence length) by electrophoresis. The separating power by one nucleotide limits the sequenced length. The augmentation of sequencing throughput relies on several aspects
• the sequencing of many identical molecules at the same time (after PCR amplification) for robust signal detection,
• the parallelization of the sequencing of many DNA templates in parallel, typically by resorting to spacial resolution.

Copy number and Structural variants

Copy number alterations or variations (CNA, CNV) were the first ones detected through the direct observation of karyotypes, and have been associated to cancer and dysfunctional phenotypes quite early. Now the copy number profile of a tumor sample is accessible at higher resolution using comparative genomic hybridization arrays (CGH) or DNA sequencing approaches. The main idea is the same: observe the variations of signal intensity (either hybridization or number of reads) along the genome to distinguish amplified or deleted regions. Due to the coverage biases mentioned before (mappability in repeated regions or GC content), a normalization is necessary, either using a matched normal sample from the same patient, or a pool of normal samples. The resulting profile represents the total copy number profile along the genome. In the case of CGH arrays or WGS, the totality of the genome is covered, while the profile is highly incomplete in the case of WES or targeted sequencing; moreover the capture step induces an additional bias to the data, making it noisier. The total copy number profile can be refined by focusing on allele-specific copy number. Indeed, the human genome is diploid, so each locus is present in two copies, for the 22 autosomes, and there exists a number of positions known as single nucleotide polymorphisms (SNPs) where each version of the locus has a different nucleotide. There are around 3 to 4 million such positions differing from the reference human genome per individual. Those are genetic variations present in the individual’s original genome and are distinct from the somatic SNVs mentioned before that are supplementary genomic alterations that occur during the individual’s lifetime. We can also distinguish a third category of alterations beyond SNPs andSNVs, that are the germline ”private” alterations of an individual that are not widespread in the population (less than 1%), and hence are not SNPs, and are not considered here. At those SNP positions, one can measure the coverage separately for each allele, and detect allelic imbalance, where one of the alleles (denoted the A allele) is amplified compared to the other (denoted the B allele). Considering the B allele frequency (BAF) allows us to obtain more detailed information about the cancer genomes alterations, and processes at their origin. To complete the analysis, the signal is segmented, either using only the total copy number or by performing joint segmentation with the BAF signal as implemented in Pierre-Jean et al. [2015], to determine regions of constant copy number, and breakpoints separating those regions. Some methods like Pindel [Ye et al., 2009] or DELLY [Rausch et al., 2012] additionally analyze the split reads covering both ends around a breakpoint to ensure better detection of structural variants, however, WGS is necessary for this step, and long reads exhibit even more power to resolve complex situations that can be incorrectly mapped to the reference genome. Similarly to the variant calling problem, many methods have been developed to uncover the structural variations of tumor genomes, and their precise error rates are hard to evaluate for similar reasons [Pierre-Jean et al., 2015]. Once the genome is segmented, the last stage of CNV calling consists in assigning integer copy number values to each segment, i.e. to determine the ploidy of the tumor. This step is highly confounded by the sample purity, and there exists multiple possible values for the pair (purity, ploidy), i.e. the problem is unidentifiable [Zaccaria and Raphael, 2018; Shen and Seshan, 2016; Favero et al., 2014]. Finally, as in the case of variant calling, the problem is actually further complexified when considering the sample as a mixture of clones with different genomic landscapes; this will be further explored in the next chapter.

Overview of existing methods

Reconstruction of the evolutionary history of a tumor using bulk sequencing data is a problem that has raised interest within the community, and over 80 approaches have been designed to solve a variety of formulations of the question. A large part of those methods (probably not an exhaustive list despite our best efforts), further denoted ITH methods have been reviewed and are summarized in Supplementary Table D.1. Considering the number of methods, we have extracted a number of features to better approach and represent the complex diversity of the developed approach. This first step has allowed us to distinguish broad categories of methods, which can be helpful for the reader or potential user to navigate among methods and identify the one(s) best suited for their needs. We then consider the problem of method evaluation, which is a key issue for further performance improvement, and finally, we outline some challenges for future developments.

Table of contents :

Introduction
Preamble
Organization and contributions of the thesis
1 Elements of cancer genomics
1.1 Interpretation of genomic features
1.1.1 Driver alterations
1.1.2 ITH and cancer evolution
1.1.2.1 Origin of ITH
1.1.2.2 A few generalities on ITH inference
1.1.2.3 Clinical implications of ITH
1.1.3 Mutational signatures
1.1.3.1 Relation with mutational processes
1.1.3.2 Approaches for signature deconvolution in cancer genomes
1.1.3.3 Future challenges
1.2 Specificities of sequencing for cancer research
1.2.1 Overview of sequencing techniques
1.2.2 Extraction of relevant features
1.2.2.1 Variant calling
1.2.2.2 Copy number and Structural variants
2 Computational methods to unravel tumor evolution from genomic data
2.1 Overview of existing methods
2.1.1 Selection of ITH methods
2.1.2 ITH method features, and attribution strategies
2.1.2.1 Input description
2.1.2.2 Output description
2.1.2.3 Preliminary algorithmic characterization
2.1.3 Main classes of methods
2.2 Challenges for method evaluation
2.2.1 Different inputs, different outputs, different problems
2.2.2 Choice of a benchmarking dataset
2.2.2.1 Simulated data
2.2.2.2 Real data
2.2.3 Metrics
2.2.4 Previous comparisons of ITH methods
2.3 Open questions for ITH inference
2.3.1 Directions for future developments
2.3.2 Method evaluation
3 Assessing reliability of intra-tumor heterogeneity estimates from single sample whole exome sequencing data
3.1 Introduction
3.2 Materials and methods
3.2.1 Data
3.2.2 Variant calling filtering
3.2.3 ITH methods
3.2.3.1 Published methods
3.2.3.2 Consensus (CSR)
3.2.4 Clinical variables
3.2.5 Survival regression
3.2.5.1 Model
3.2.5.2 Evaluation procedure
3.2.6 Immune signatures
3.2.7 Correlations
3.2.7.1 Comparison metrics
3.2.8 WES and single cell paired dataset
3.2.8.1 Data availability and preprocessing
3.2.8.2 Evaluation metrics
3.3 Results
3.3.1 Assessing ITH on TCGA samples
3.3.2 Methods quantifying ITH exhibit inconsistent results
3.3.3 ITH is a weak and non robust prognosis factor
3.3.4 ITH prognosis signal is redundant with other known factors
3.4 Discussion
3.4.1 Comparison to similar studies
3.4.2 Can we truly measure ITH?
3.4.3 Association with survival, link with other variables
3.4.4 Can we build a gold standard dataset for benchmark?
4 CloneSig: Joint Inference of intra-tumor heterogeneity and signature deconvolution in tumor bulk sequencing data
4.1 Introduction
4.2 Results
4.2.1 Joint estimation of ITH and mutational processes with CloneSig
4.2.2 Performance for subclonal reconstruction
4.2.3 Performance for signature deconvolution
4.2.4 Pan-cancer overview of signature changes
4.2.5 Clinical relevance of ITH and signature changes
4.3 Discussion
4.3.1 Improved ITH and signature detection in WES
4.3.2 Clinical relevance of signature variations
4.3.3 Importance of input signatures and challenges
4.4 Materials and methods
4.4.1 CloneSig model
4.4.2 Parameter estimation
4.4.3 Test of mutational signature changes
4.4.4 Simulations
4.4.4.1 Default simulations
4.4.4.2 Simulations for comparison with other ITH and signature methods
4.4.4.3 Simulations without signature change between clones
4.4.4.4 Simulations to assess the separating power of CloneSig
4.4.4.5 Simulations to assess the sensitivity of the statistical test
4.4.5 Evaluation metrics
4.4.5.1 Metrics evaluating the subclonal decomposition
4.4.5.2 Metrics evaluating the identification of mutational signatures
4.4.6 Implementation
4.4.7 Data
4.4.8 Copy number calling and purity estimation
4.4.9 Variant calling filtering
4.4.10 Construction of a curated list of signatures associated with each cancer type
4.4.11 Survival analysis
5 Closing remarks
5.1 Conclusion
5.2 Perspectives
5.2.1 How relevant is the number of clones to quantify tumor evolution?
5.2.2 The necessity to go beyond the TCGA
References