Data management and alignment issues – Project topics materials

Get Complete Project Material File(s) Now! »

Cancer development

Cancer development is a very complex process, which can significantly di↵er between cancer types and stages of the disease. There are approximately two hundreds of genes associated with cancer in contemporary researches.
These genes are usually classified into three groups:
• Proto-oncogenes are responsible for the synthesis of proteins that stim-ulate cell division or prevent cell death. Malfunctioning forms of these genes are called oncogenes.
• Tumor suppressor genes are responsible for producing proteins that prevent cell division or accelerate cell death (in particular, trigger apoptosis).
• DNA repair genes help preventing mutations in all genes (including the first two groups) and thus preventing cancer development.
If the cell division becomes significantly more intensive than usual, it may contribute to the development of a tumor. The contribution to this process of mutations in the genes of each group is explained below.

Proto-oncogenes

The first group of genes related to cancer is proto-oncogenes. In normal cells they are responsible for production of proteins controlling the cell division. Such proteins are involved in a process called signal transduction cascade. The conversion of proto-oncogene into oncogene is called activation. It can occur with the involvement of three mechanisms ([LH00]):
• Mutation of or a translocation in the proto-oncogene. Such mutation can result in the intensified action of the encoded protein.
• Duplication (gene amplification) of a DNA segment that includes a proto-oncogene, leading to overexpression of the encoded protein.
• Rearrangement of genes in a chromosome or an inter-chromosomal translocation. Such rearrangements can move proto-oncogene to a new location under the control of di↵erent promoter, causing abnormal gene behaviour.

Tumor suppressor genes

The second group is tumor suppressor genes. These genes are responsible for synthesis of proteins suppressing cell growth and division. Such proteins may act in di↵erent cell areas: nucleus, cytoplasm or membrane. Mutation of these genes results in a loss of a function, which contributes to uncontrolled cell growth or division.
Tumor suppressor genes are usually recessive: the disease does not de-velop until both gene copies are mutated. The first mutation can be already present in the germ line cell, making all child cells inherit it. If a mutation later occurs in the second gene copy, the uncontrolled cell division starts. This leads to a higher cancer frequency among individuals inheriting the mutation in tumor suppressor gene than in the population as a whole.
It illustrates the fact that heredity can be an important cancer factor. However, even mutations in two copies of a tumor suppressor gene can occur in a somatic cell, usually caused by environmental factors.
There are several types of cancer, associated with tumor suppressor genes defects:
• Familial adenomatous polyposis of the colon (FPC) is caused by mu-tations in both copies of APC ;
• Hereditary breast cancer is caused by mutations in both copies of Brca2 ;
• Hereditary breast and ovarian cancer is caused by mutations in both copies of Brca1.
Another typical example is hereditary retinoblastoma, a retina cancer that occurs in the childhood. It is caused by a mutation in the RB1 tumor suppressor gene ([AD04]). Mutation in one copy is usually transmitted to the o↵spring from one of the parents. Mutation in the second copy is highly probable to occur because of a large number of retinoblasts and rapid division of cells of this type. About ninety percent of children inheriting RB1 mutation develop retinoblastoma. Only individuals younger than eight years old have retinoblasts, so the risk of retinoblastoma exists only in the early childhood. However, RB1 mutation is also dangerous for adults as it increases the risk of several other cancer types.

DNA repair genes

The third group of genes related to cancer is DNA repair genes. The pro-teins encoded by these genes are responsible for correcting the malformed nucleotide sequences.
The damage to DNA is very common and can be caused by various factors such as radiation, UV light, chemicals and poor environment. Errors in DNA replication can also cause DNA damage. The products of DNA repair genes fix the broken sequences and thus minimise the number of mutations in cells. When such a gene is mutated itself, it may not code for a functional corresponding protein any more. Lack of DNA repair significantly increases the frequency of cancerous DNA changes.
A well-known example of DNA repair gene is Xeroderma pigmentosum (XP). Malfunction of this gene causes an increased sensitivity to UV light. Individuals with such mutations have a thousand-fold increased probability for the development of all skin cancer types. Another example of a dis-ease related to broken DNA repair is Bloom syndrome. It is an inherited ailment, caused by the mutation in BLM gene, required to support stable DNA structure. Individuals with this syndrome have a high frequency of DNA alterations, which leads to an increased risk of cancer and diabetes.

Examples of structural variants that can result in cancer development

Various examples of structural variants that are related to cancer are pro-vided below. These examples include SVs that cause mutations in all three types of genes considered in the previous section: oncogenes, tumor sup-pressor and DNA repair genes.
There are several examples of inter-chromosomal translocations that cause the formation of new oncogenes. Such oncogenes are called fusion genes. Fusion genes contribute to tumor formation, producing proteins that are more (or even constantly) active. The result of such translocations is a modified form of the gene, contributing to cancer by accelerating the cell growth. Most proto-oncogene mutations are dominant: a single gene copy is enough to cause uncontrolled cell growth. The presence of an activated oncogene in germ line cells causes the child to inherit predisposition for cancer.
One of the most famous examples of this kind of genes is EWS-FLI1 fu-sion gene. EWS-FLI1 is a chimeric protein formed by a tumor-specific translocation between chromosomes 11 and 22. Such translocations are found in both Ewing’s sarcoma and primitive neuroectodermal tumor.
EWS-FLI1 amino-terminal domain is a much more potent transcrip-tional activator than the corresponding amino-terminal domain of FLI-1. Moreover, EWS-FLI1 eﬃciently transforms NIH 3T3 cells, while FLI-1 does not. Ews/Fli1 [Lee07], functioning as a transcription factor, leads to a phenotype dramatically di↵erent from that of cells expressing FLI-1.
Figure 1.2: Fusion of chr11 and chr22 encoding for Ews/Fli1 causes Ewing sarcoma. Adopted from [Lee07]
Another example is the Philadelphia chromosome fusion of chromosomes 9 and 22. It gives a match of the ABL1 gene on chromosome 9 (region q34) to a part of the breakpoint cluster region (BCR) gene on chromosome 22 (region q11) [RKMT03]. Such fusion encodes a new oncogenic protein called BCR/ABL. The detection of this translocation is a highly sensitive test for chronic myelogenous leukemia (CML), since 95% of patients with CML have this abnormality.
Figure 1.3: Philadelphia chromosome fusion of chr9 and chr22. (Thanks to James Griﬃn for the figure)
Not only translocations can damage a tumor suppressor gene or create a functional mutation in a proto-oncogene, all types of structural variants can be involved in the process. Below we consider SVs that cause gain and loss of the genetic material. Discovering such SVs helps to predict the aggressiveness of cancer development.
Mycn amplification, which occurs in approximately 22% of primary neu-roblastomas, is one of the most powerful prognostic factors identified to date. It is significantly associated with advanced-stage disease, rapid tumor progression, and poor prognosis. Interestingly, it is shown in [BBPL+09] that children with Mycn-amplified, hyperdiploid, favourable-stage tumors had significantly better survival than those with diploid tumors. Figure 1.4 shows the Kaplan-Meier survival curves for 31 Mycn-amplified stage A, B, and Ds patients by ploidy ([Sch]).
Analysis of chromosomal aberrations is used to determine the prognosis of neuroblastomas (NBs) and to aid treatment decisions. It is shown in [CKN+10] that patients with with di↵erent genomic profiles have di↵erent survival rates. Authors studied Mycn gene amplification, 11q deletion and 17q gain, and genomes with numerical aberrations (i.e., whole-chromosome gains and losses).
Another example of the importance of the amplification detection is given by the ERBB2 gene. Over expression of the receptor tyrosine kinase ERBB2 (also known as HER2) occurs in around 15% of breast cancers and is driven by an amplification of the ERBB2 gene.

Sequencing technologies

The ability to read a DNA sequence and produce a digital representation for it is the basis of a huge part of contemporary biological researches. The first approach answering this challenge was capillary electrophoresis (CE)-based Sanger sequencing. This technology gave the ability to extract the genomic information from any organism and thus was widely adopted by scientists around the world. However, this technology has significant limitations in speed, scalability and resolution, which make it hardly usable for various studies.
An entirely new technology overcoming these limitations, Next Genera-tion Sequencing (NGS), was created at the beginning of the 2000s. NGS is a fundamentally di↵erent approach, which started a revolution in genomic science. This approach not only allowed to decipher whole human genome, but also reduced the cost of whole genome sequencing by three to four orders of magnitude during last 15 years.
The principle concept of NGS technology is similar to Sanger sequencing : the bases of a DNA fragment are sequentially identified from signals emitted as each fragment is resynthesized from a DNA template strand. The crucial di↵erence is that NGS allows to read millions of fragments simultaneously. This enhancement allows the latest instruments to read large stretches of DNA in a massively parallel fashion, producing hundreds of gigabases in a single sequencing run.
NGS technologies can provide three types of data: single-end, mate-pair and paired-end short read data (Illumina, Life Technologies) and single-end long read data (PacBio). Both Illumina and SOLID paired-end and mate-pair sequencing produce pairs of reads suitable for the detection of large SVs. PacBio is the newest sequencing technology; the first commercial product, PacBio RS, was sold to a limited set of customers in 2010 and commercially released in early 2011. As it still has limited availability and high product price, we concentrate in this work on mate-pair and paired-end data and do not cover long single-end PacBio reads.

READ Waves generated by a moving bottom

Paired-end data

The key steps of a sequencing project are the same for both mate-pair and paired-end technologies: preparation and amplification of template DNA, distribution of templates on a solid support, sequencing and imaging, base calling and quality control.
The first step in preparation of the sequencing library is DNA fragmen-tation. For this purpose, sequencing adapters are ligated to both ends of the DNA fragments. Then PCR amplification using primers complementary to the adapters is performed.
Same adapters are placed on the flow cell (in Illumina SGS technology); then, fragments are placed on the flow cell and two complement adapters are attached to each other. The flow cell can have di↵erent shapes. For example, for the Illumina it is a flat glass plate; 454 uses beads with adapters on it, and there is place on the plate for each of the beads.
Once a fragment is attached to the corresponding adapter, polymerase creates a complement of all the sequence. Finally, the double-stranded DNA is unwinded, the original strand is washed away and the process is repeated.
The typical insert size (the distance between paired reads) for paired-end data is rather small: several hundreds of bases. The reads in a fragment in paired-end data are oriented towards each other. As explained further, paired-end data is less suitable for complex tasks including structural vari-ants detection than mate-pair data. The main advantage of paired-end sequencing is its simple workflow making it widely spread.

Mate-pair data

Mate-pair libraries are created in a slightly di↵erent way. DNA is split into sequences that are longer than those for paired-end data. As for pair-ended technology, adapter sequences are ligated to both ends of the DNA fragments. Then the sequence is circularised: the two ends of the original DNA fragment are both adjacent to each other.
A special heavy biotin molecule is placed between the two adapters. When fragmentation of the circular DNA is finished, the fragment that contains original linear DNA ends is selected using biotin capture. Errors can be introduced at this stage, as it is not always possible to robustly choose the mate-pair fragments. As a result, the mate-pair data is usually contaminated with paired-end fragments with a di↵erent average insert size. Such fragments are called singletons.
The end of the sequencing process is exactly equal to the one used for paired-end, i.e. fragments are placed on the solid cell and amplified. Se-quencing of both ends of the selected fragment yields reads that are sepa-rated by the length of the original fragment.
Mate-pair libraries allow larger insert sizes than paired-end, from 2 to 20 kilobases. Large inserts are especially valuable in de novo sequencing projects, where they can substantially improve sca↵olding (ordering of as-sembled contigs). In contrast to paired-end reads, which are oriented to-wards each other, mate-pair reads are either both oriented outwards from the original fragment (Illumina protocol) or both have the same orientation (SOLiD protocol), which needs to be considered in the data analysis.
The major drawback of mate-pair sequencing is the complicated labo-ratory protocol. Another problem is that a substantially larger amount of DNA (5 to 120 times) is required to prepare a mate-pair library.

Approaches to structural variants detection

In this section main SV detection approaches are briefly discussed to get a general idea about existing methods. Di↵erent in-depth details are provided in further chapters: most widely used tools implementing these approaches are presented in Chapter 8 and compared in Chapter 9.
Most of the current SV detection methods can be classified into three categories: methods based on paired end mapping (Pem) signatures, depth of coverage (Doc ), and split-read mappings [MSB09]. Each approach has its own limits in terms of the types and sizes of SVs that it is able to detect.

Pem based algorithms

Pem-based algorithms may be based either on read clustering or on fragment length distribution.
The former category identifies discordant Pems as Pems with unex-pected orientation or insert size, clusters them and applies statistical tests to validate candidate clusters [HAES09, KAM+09, HHD+10, ZBJL+10].
The latter compare the observed insert size distribution of all read pairs in a given window versus the expected distribution. Windows with a signif-icant proportion of read pairs having unexpected insert sizes are annotated as containing SVs (Lee et al., 2009).
In some cases the same package, e.g., BreakDancer [CWM+09], pro-vides two complementary methods for SV detection: clustering-based (Break-DancerMax) and distribution-based (BreakDancerMini) to detect large and small size SVs respectively.

Doc based algorithms

Doc-based methods detect regions in the genome where genomic material is gained or lost. They rely on some evaluation of the expected Doc, nor-malised for GC-content bias [YXM+09, BZB+11, VBTPKBPCJCGSIJLOD12]. A deviation from the expected Doc suggests putative gain or loss of genomic material.
Doc-based methods do not provide information about the adjacency of DNA regions involved in copy number changes. Thus, such methods are not able to indicate the type of SV (e.g., tandem duplication, fragment reinsertion, translocation) causing genomic loss or gain. Additionally, the resolution of such methods is rather low for low Doc datasets: a 30x cov-erage dataset allows approximately a resolution of 1Kb for rearrangement breakpoints.

Split-read based approach

Split-read based methods use partial read alignments for SV detection [WME+11, SHB+14, TEER14]. Although such methods may be eﬃcient for data with high read coverage, they may fail to identify SVs with breakpoints located in repetitive elements of the genome.
Ideally, this approach should be combined with paired-end signatures; this idea was implemented in SVMerge [WKSA10], Prism [JWB12], Meerkat [YLG+13], Smufin [MGB+14] and Delly [RZS+12].

Combination of di↵erent approaches

Combining information about discordant Pems with changes in Doc is a promising solution for the SV detection problem. Probabilistic models inte-grating both the Doc signal and Pem signatures provide higher specificity together with equal or greater sensitivity than tools that simply use paired-end signatures [QZ11, ORA+12, SOP+12, ETB+13, LCQH14, HKNM11].
However, most of these methods do not take into account two important parameters that a↵ect read count for both normal and abnormal mappings: GC-content and read mappability. Another general drawback of the major-ity of these methods is their lack of ability to detect all possible types of SV that can be present in cancer data including co-amplifications, tandem duplications with inversions, linking insertions, etc.

Table of contents :

0.1 Introduction
1 State of the art
1.1 Cancer development
1.1.1 Proto-oncogenes
1.1.2 Tumor suppressor genes
1.1.3 DNA repair genes
1.2 Examples of structural variants that can result in cancer development
1.3 Sequencing technologies
1.3.1 Paired-end data
1.3.2 Mate-pair data
1.4 Approaches to structural variants detection
1.4.1 Pem based algorithms
1.4.2 Doc based algorithms
1.4.3 Split-read based approach
1.4.4 Combination of di↵erent approaches
2 Data management and alignment issues
2.1 Mappability of fragments
2.2 Alignment issues
2.2.1 Sequencing data
2.2.2 Reads alignment
2.2.3 Probabilistic approach to mapping position selection
2.2.4 Paired reads alignment
2.3 Alignment algorithms
2.3.1 Aligner choice rationale
2.3.2 BWA drawbacks for structure variants detection
3 Structural variant detection based on paired-end mapping signatures
3.1 Annotation of normal and abnormal read pairs
3.1.1 Detection of normal fragments orientation
3.1.2 Definition of normal insert size
3.1.3 Formal definition of normal and abnormal fragments
3.2 Clustering of abnormal fragments
3.2.1 Cluster definition
3.2.2 Primary clustering algorithm
3.2.3 Splitting algorithm
4 Coverage issues
4.1 GC-content
4.2 Ploidy and copy number changes
4.3 Mappability
4.3.1 Repeats in human genome
4.3.2 Reads mapping and DOC
4.3.3 Mappability of a genome position
5 Abnormal and flanking regions
5.1 Definition of flanking regions and abnormal regions
5.1.1 Flanking regions
5.1.2 Abnormal region
5.2 Expected number of fragments in a genomic region
5.2.1 Expected number of fragments for a flanking region
5.2.2 Estimation of the number of fragments starting on a position considering GC-content
5.2.3 Expected number of fragments for the abnormal region
6 Bayesian models
6.1 Model definition
6.2 Bayesian approach
6.2.1 Conditional probability of a model
6.2.2 A priory probability of a model
6.2.3 Factorization of the conditional probability
6.2.4 Set of models to test
6.2.5 Model choice
6.3 Evaluation of the breakpoint position
7 SV types and assembly workflow
7.1 Structural Variant types
7.1.1 Simple structural variations
7.1.2 Complex structural variants
7.2 SV assembly process
8 Overview of competitive methods for SV detection
8.1 Methods description
8.1.1 GasvPro
8.1.2 BreakDancer
8.1.3 Lumpy
8.1.4 Delly
8.2 Configuration used for considered software
8.3 Main features and scopes
9 Results on simulated and real data
9.1 Simulated data
9.1.1 Simulation of normal genome
9.1.2 Simulation of cancer genome
9.1.3 Simulation of sequencing data
9.1.4 Mate pair and pair-ended datasets statistics
9.2 Comparative performances on simulated data
9.2.1 Precision and recall
9.2.2 Detailed results
9.3 Comparative performances on a neuroblastoma mate-pair dataset
9.3.1 Experimentally validated structural variants
9.3.2 Predicted structural variations
9.3.3 SNP6-experiments
9.4 Discussion of the results
9.4.1 Sequencing technology and coverage
9.4.2 Influence of the presence of copy number variation on the SV prediction accuracy
9.4.3 Breakpoint resolution
9.5 Execution time comparison
10 Conclusion and perspectives
10.1 Conclusion
10.2 Perspectives