Optimal computational scheme for large DNA copy number proles

Get Complete Project Material File(s) Now! »

Some technologies to study genomic rearrangements

There are many di erent technologies to study these events such as Comparative Genomic Hybridiza-tion (CGH) arrays and Single Nucleotides Polymorphism (SNP) arrays. Historically, the genome-wide study of DNA copy number changes was performed using the CGH technique, which was developed in the early 1990s. In this technique, total genomic DNA is isolated from tumor and normal control cells, labeled with di erent uorochromes and hybridized to normal metaphase chromosomes (Kallioniemi et al., 1992). This technique is therefore called chromosomal CGH. Di erences in the tumor uores-cence with respect to the control uorescence along the metaphase chromosomes are then quanti ed to re ect changes in the DNA copy number of the tumor genome. Subsequently, array CGH, where arrays of genomic sequences replaced the metaphase chromosomes as hybridization reporters, was es-tablished (Solinas-Toldo et al., 1997; Pinkel et al., 1998) and solved many of the technical di culties and problems caused by working with cytogenetic chromosome preparations. The main advantage of array CGH is its ability to perform copy number analyses with a much higher resolution compared to chromosomal CGH (resolution smaller than a megabase compare to several megabases for chromoso-mal CGH). Array CGH has already been widely used in oncology for many purposes such as global analysis of copy number aberrations, identi cation of putative target genes, tumor classi cation or assessment of clinical signi cance of copy number changes (Kallioniemi, 2008). Pinkel and Albertson (2005) give details in their review about the technology and its application in oncology. Here, we will present only the general outline of the protocol (see Figure 3.4):
1. Total genomic DNA is isolated from a tumor sample (i.e. the test DNA) and from a normal sample (i.e. the reference DNA). Genomic DNA is then digested with a restriction enzyme and the obtained DNA fragments are labeled. The tumoral DNA is usually labeled with a red uorochrome and the normal DNA with a green uorochrome.
2. Both the tumoral and normal DNA are hybridized on the same chip. For each spot, there is a competitive hybridization between the tumoral DNA target sequences and the normal DNA target sequences.
3. After hybridization, the chip is scanned and the signal intensity is quanti ed for both the red and green wavelengths. Image les are created in which each pixel is given a red and a green intensity.
4. An image analysis software reconstructs the signal intensity for each spot.
In the case of SNP arrays, the protocol is quite similar except that there is no normal DNA reference.

DNA copy number pro les of SNP and CGH arrays

In CGH arrays, the DNA copy number is obtained by comparing the test sample with a normal reference sample. This is often done with the ratio of the measured intensity of the test sample and reference. For example, a ratio of 1 means that the usual 2 copies of DNA are present in the test sample (see Figure 3.5). While doing this, one assumes that the DNA copy number of the reference is 2. However, it is not necessarily the case as copy number polymorphisms are common in the healthy population.
The main di erence between CGH and SNP arrays is that SNP arrays usually do not use a reference (see Figure 3.6). Moreover, the DNA copy number is not measured directly but rather computed as the sum of the intensities of both alleles. In this way, one observes di erences between regions of the genome but it is not necessarily easy to determine the intensity that corresponds to 2 DNA copies. This is all the more true for TNBC as they harbor many rearrangements and one cannot assume that the mean or median intensity of all probes corresponds to the intensity of 2 DNA copies.
Studying SNP arrays also gives information about LOH. This is valuable information to recover the reference intensity of 2 DNA copies in TNBC (Popova et al., 2009).
Figure 3.6: Illustration of chromosomes and the corresponding SNP pro les. Using SNP arrays, a test sample is analyzed without any references. The level corresponding to 2 DNA copies is usually chosen as the mean or median intensity of the whole pro le. (Left) If there is no DNA copy number changes, the pro le should be a set of points on the same line. (Right) If there is some changes, points in gained regions appear above others and points in lost regions appear below others. For both CGH and SNP arrays, we expect a limited number of possible values for the measured intensity. If we could measure the copy number almost continuously along the genome, we would expect a constant signal, except for a few abrupt changes corresponding to gains and losses. However, there are measurement errors and noise is observed around the signal, which complicates the analyses.

An overview of CGH data analysis

Many methods have been developed speci cally to analyze CGH arrays (see the review by van de Wiel et al. (2010)). They can be divided into two categories: pre-processing and downstream methods.
Pre-processing methods are a critical step because their results a ect any following analyses and their biological interpretation. Pre-processing usually consists in the following steps: Control Assess the quality of the experiment via a number of checkpoints.
Normalization Remove artifacts that hamper our ability to extract the biological signal. Segmentation Divide the genome into regions sharing the same DNA copy number. Calling Recover the DNA copy number (0, 1, 2, 3…) or at least try to make the di erence between normal, gained and lost regions.
Once these di erent steps have been performed, many di erent types of downstream analyses can be performed depending on the biological or clinical questions. Many speci c methodologies have been proposed to identify:
recurrently aberrant regions (across tumors).
new subgroups of cancer (unsupervised classi cation).
markers associated to prognosis, diagnosis or other clinical variables of interest (supervised clas-si cation or regression).
In the following chapters, I will highlight some of my contributions to the normalization (Rigaill et al., 2008) and segmentation (Rigaill et al., 2010c; Rigaill, 2010b) of these DNA copy number pro les.

Speci cities of tumor DNA copy number pro le normalization

For gene expression pro ling normalization, it is usually assumed that the majority of genes are not di erentially expressed and that the proportions of down-regulated and up-regulated genes are similar. This hypothesis is questionable, but normalization methods relying on this assumption were shown to be quite e cient (Do and Choi, 2006). However, for DNA copy number pro ling of tumors, this is clearly not the case. Indeed, some tumor samples, especially TNBC, are genomically unstable and harbor many genomic rearrangements. For these tumors, there is no reason to think that the number of gains equals the number of losses. Moreover, it has been empirically shown that not taking into account DNA copy number alterations in CGH arrays of tumor samples causes problems for conventional normalization methods (Staaf et al., 2007). More speci cally, it leads to over- tting and a decreased signal to noise ratio. We con rmed this result for A ymetrix SNP array 50K and 250K (data not shown).
Another speci city of tumor DNA copy number pro le normalization is the possibility to assess (without knowing the true DNA copy number) the signal to noise ratio of a given normalization procedure (Neuvial et al., 2006). The idea is that, after using a given normalization procedure, it is possible to identify gained, lost and normal regions of the genome. This is the \calling » step. It is then possible to compute:
the \signal » as the di erence between the mean gain intensity and the mean normal intensity; the \noise » as the residual error of the signal.
When comparing two di erent normalization methods, it is important to compute their \signal » and \noise » with the same de nition of gained and normal regions. Indeed, a method detecting more gained regions would not be favored. This is because some of these extra gained regions would necessarily correspond to small di erences between normality and gain, resulting in a smaller signal to noise ratio. Therefore, only consensus gained and normal regions should be used (see Figure 4.1). Overall this is certainly not an unbiased estimation of the signal to noise ratio as it heavily relies on the calling step.
However, for a given calling procedure, it seems a good way to assess the relative advantages of various normalization procedures.
In conclusion, when normalizing tumor DNA copy number pro les, it is important to take into account both the non-relevant factors and the DNA copy number alterations. Moreover, without knowing the true signal it is possible to evaluate and compare the signal to noise ratio of two di erent normalization methods. Keeping all this in mind, we worked on the normalization of A ymetrix Genechip 50K and 250K SNP arrays.

READ The influence of global environments on strategy and HIV/Aids interventions

Normalization of A ymetrix Genechip 50K and 250K SNP arrays

In this section, I will give an overview of the ITALICS normalization method that I proposed to normalize A ymetrix Genechip 50K and 250K SNP arrays (Rigaill et al. (2008), the article is provided in the following section: 4.4). Besides normalization, ITALICS performs the analysis of the DNA copy number pro les using the GLAD methodology (Hupe et al. (2004)). GLAD performs both the segmentation and calling step. The ITALICS method is available as an R package in Bioconductor.
As in any microarray, A ymetrix Genechip 50K and 250K SNP arrays are in uenced by non-relevant factors such as the probe GC content, spatial artifacts and others (see Figure 4.2 for an overview of the experimental protocol). To take into account both the non-relevant factors and the DNA copy number as suggested by Staaf et al. (2007), ITALICS iteratively and alternatively segments the DNA copy number pro le and estimates the in uence of the non-relevant factors (see Figure 4.3). Having a rough rst estimation of the DNA copy number pro le, it is possible to correct for non-relevant factors using a multiple linear regression. Continuing from this corrected pro le, we then re-iterate the segmentation and correction steps to improve their qualities (see subsection 2.2 and Table 1 on page 2-3 of the paper for a more detailed description). We have empirically shown that two iterations are required to achieve a good signal to noise ratio (see subsection 3.1 and Figure 2 on page 4 of the ITALICS paper).

Table of contents :

I Introduction
1 Overview
1.1 Introduction
1.2 Methods for the analysis of DNA copy number proles
1.3 Biostatistical analysis of the transcriptomic Curie-Servier dataset
1.4 Conclusion
2 A small introduction to Triple Negative Breast Cancers
2.1 Breast cancers
2.2 Triple Negative and Basal-like breast cancers
2.3 Breast tumors of the Curie-Servier cohort
II Genomic Analysis
3 Chromosome aberrations
3.1 Some technologies to study genomic rearrangements
3.2 DNA copy number proles of SNP and CGH arrays
3.3 An overview of CGH data analysis
4 Normalization of DNA copy number proles
4.1 Short overview of microarray normalization
4.2 Specicities of tumor DNA copy number prole normalization
4.3 Normalization of Aymetrix Genechip 50K and 250K SNP arrays
4.4 Paper: ITALICS
5 Segmentation of DNA copy number proles
5.1 A piecewise constant model for the analysis of DNA copy number proles
5.2 The CGHseg methodology
5.3 Assessing the quality of a given segmentation
5.4 Paper: Exploration of the segmentation space
5.5 Optimal computational scheme for large DNA copy number proles
5.6 Paper: Pruned dynamic programming for segmentation
6 Analysis of the Curie-Servier Genomic dataset
6.1 Genomic alterations in breast cancers and in TNBC
6.2 Analysis of the genomic Curie-Servier dataset
III Transcriptomic Analysis
7 Introduction
8 Experimental Design
8.1 A small introduction to experimental design
8.2 Design of the transcriptomic experiment
9 Pre-processing
9.1 Probe annotation
9.2 Normalization
10 Exploratory Analysis
10.1 Validation of the pre-processing step
10.2 A robust classication of breast tumors, but no intrinsic gene list?
11 Comparison of TNBC with other tumor types
11.1 Gene by gene dierential analysis
11.1.1 Statistical testing
11.1.2 Other lters
11.1.3 Paper: Frequent PTEN genomic alterations
11.1.4 Paper: Formins regulate tumor cell invasion
11.2 Pathway by pathway dierential analysis
11.2.1 Paper: Reactive oxygen species (ROS) control myobroblast and metastases
11.2.2 An overview of the Wnt pathway in breast cancers
11.2.3 Transcriptomic statistical analysis of the Wnt pathway
IV Conclusion
A A few more papers
A.1 DNA Breakpoints to Dene True Recurrences Among Ipsilateral Breast Cancers
A.2 Genome Alteration Print (GAP)