ESTIMATE PROPER SIGNAL PERCENTAGE CUTOFF TO CALL CORE REGIONS OF INITIATION ZONES

Get Complete Project Material File(s) Now! »

EdU-seq-HU

The incorporation of halogenated nucleotides is a conventional method to monitor the ongoing replicated regions during the S phase. The BrdU, which is frequently used in replication studies, can be detected by anti-BrdU antibody only after the DNA becomes single-stranded due to resection (Mukherjee et al., 2015). EdU (5-ethynyl-20-deoxyuridine) is another thymidine analog that has some technical advantages over BrdU usage, since EdU will be conjugated to fluorescent aside by Cu(I)-catalyzed reaction and can be detected in double-stranded DNA (Hua and Kearsey, 2011). Unfortunately, the EdU is toxic to the cells and activated the rad3-dependent checkpoint, which likely blocks over mitosis. Toxicity effects of EdU on mammalian cells have also been reported, suggesting that EdU may not be suitable for continuous labeling studies (Hua and Kearsey, 2011). So, it often takes several times to confirm the mark position by EdU in more than one cell cycle (Diermeier-Daucher et al., 2009; Hua and Kearsey, 2011). In addition, HU (hydroxyurea) can arrest fork progression after origin firing. Therefore, under the HU treatment on cells synchronized at the S phase entry will allow enriching the EdU signals around the replication origins. For a limited DNA synthesis situation, in the hydroxyurea-treated cells, EdU incorporation can be easily detected under fluorescence microscopy. Thus EdU-seq-HU protocol has been developed to locate the early replicated origin positions (Macheret and Halazonetis, 2019).
The problem is the cell arrest led to an incomplete cell cycle, which can only detect the origins fired at the beginning of the S phase, and cannot identify the replication origins fired in other periods, such mid or late S phase.

The current single-molecule technologies used for origin identification

All the methods discussed before are origins or potential origins detection from various populationbased data. There is a low agreement amongst various genome-wide studies. Regardless of the mechanism level, the major debate is whether replication origins are located at specific sites or stochastic occurred in broad initiation zones. And most methods, more or less, have their own technical or biological problems, leading to different population-based methods that might identify different “types” of origins. Whatever the main reason for the controversial results is the heterogeneity of the choice of replication initiation between cells. At present, the best way to solve this problem is to detect the origin of replication at the single-molecule level. Below I will introduce several commonly used single-molecule detection methods.

A novel method: ORM (optical replication mapping)

A variety of methods of replication origin detection have been listed above. More or less, these methods have their own shortcomings in biological or technical means. For a population-based approach, SNS-seq may accumulate the short nascent strand close to the G4 region due to fork stalling, which causes false positive origins. EdU-HU is toxic to cells, which makes continuous labeling become hard and may have an effect on the physiological process of cells. But currently, the biggest problem for EdU-HU is that it can only detect the replication origins in the early S phase after synchronization by HU. Similarly, when it comes to the impact on the replication process, nuclear extraction in ini-seq also has the same problem. It is unknown whether the in vitro system constructed by ini-seq can 100% simulate the DNA replication process in vivo. Concerned with the versatility of ORC, ORC-ChIP-seq contains false-positive results that may be more related to transcription instead of replication. OK-seq may not have biological bias, but the major problem of OK-seq is that if you have a transition, you can identify the initiation zones, while it does not mean that all origins/initiation zones can generate upward transitions. For example, in late replicating regions, the initiation is more or less random, the RFD is close to 0 (with equal probability replicated by leftward or rightward replication forks) within these regions. Not to mention, there is only a very limited consistency between these results. Due to the low fire efficiency of replication origins and the heterogeneity of origin selection. Undoubtedly, the singlemolecule method with high sensitivity is the potential way to solve these problems. But DNA combing doesn’t have enough throughput to support genome-wide detection. Nanopore-seq can’t be applied to the human genome because of the high cost. In summary, nowadays, the main requirements for detecting replication origins are high-throughput, at the single-molecule level, ultra-long DNA molecules for precise alignment, also take into account the cost and the coverage. Therefore, a new optical matching method has emerged, which takes all of the above advantages.

Bionano high-throughput DNA fiber mapping

Figure 1.16 Schematic of Bionano principle (from Introduction of the Bionano Genomics company). The blue line labels DNA fibers and green dots on the blue line is the green fluorophore (Nt.BspQI sites or DLE1 sites) that recognizes a specific motif sequence. The top photo is the raw picture, and the bottom shows the alignment of Originally, Bionano was a technology used for de novo genome assemblies. For some conserved motifs that repeatedly appear on the genome, tag such motif with fluorophores. Because the reference sequence is known, the position of the fluorophore corresponding to the motifs on the genome is also determined. When DNA fibers are also labeled by the fluorophore, researchers can map the DNA fibers to the reference according to the relative position of the fluorophore on the fibers (Fig. 1.16).
The Bionano platform uses electrophoresis to controls the movement of DNA from the flowcell. The upstream micro-and nano-structures gradience can gently unwind and guides DNA into the NanoChannels. Only stretched linear DNA fibers are allowed to flow through NanoChannels and a high-resolution camera will image them once DNA molecules enter into NanoChannels. In addition to YOYO-1 label DNA molecules (in blue), the Bionano platform also equipped two additional channels for detecting two kinds of color signals (i.e. green and red). One is the green fluorophore used for mapping sequence to reference. Recently, the red channel has been applied to origin detection (De Carli et al., 2018). In a similar manner, in our optical method, we used red dUTP signals for labeling ongoing replication regions or replication origins in synchronized cells Figure 1.17 Physical image of ORM methods. The red regions are clustered by sparse dUTP signals. Just because they are too close together forming a piece, which looks like a red area. But each red signal has a certain genomic position during the data process. Blue lines and green dots are DNA fibers and mapping green fluorophore (NLRS sites or DLE1 sites) The average length of DNA fibers analyzed by Bionano can be up to around 300 kb and coverage can be up to 300x of the human genome with one run of Bionano imaging of the latest Bionano system. So, it not only meets the requirement for high-resolution ultra-long fibers like nanopore but also with such a qualified coverage at the single molecular level. Like all single-molecule methods, it can detect the initiation events with low fire efficiency. This is impossible for methods of bulk data. We can have more comprehensive origin information to study the DNA replication process.
Furthermore, for any given position, we can calculate the ORM signal density as fire efficiency to describe the probability of initiation occurred in this position. The accessibility of fire efficiency calculation is a huge advantage for Bionano over other approaches to detect the replication origins,
because there is so much information that can be further mined and analyzed based on it. Firstly, based on the fire efficiency in different replicates, we can observe whether there are independent, fixed origin sites that show high fire efficiency in all replicates. If not, how initiation event occurred, follow the stochastic model or domino-like model. Then, where the high fire efficiency regions distributed, and how the relationship between fire efficiency and replication timing. How does ORM compare with other methods? What about the genetic functional annotation and epigenetic modification for initial zones? All these issues will be revealed one by one in the content of this Ph.D. study.

Table of contents :

INTRODUCTION
1.1 DNA REPLICATION MECHANISM AND THE CORRESPONDING KNOWLEDGE
1.1.1 CELL CYCLE
1.1.2 REPLICATION ORIGINS
1.1.3 REPLICATION UNIT
1.1.4 THE COMPLETE BIOLOGICAL REPLICATION INITIATION PROCESS
1.1.5 REPLICATION TIMING
1.2 REPLICATION REGULATION IN TIMING AND ORIGIN LOCATION
1.2.1 THE GENETIC AND EPIGENETIC MODIFICATIONS AROUND ORIGINS
1.2.2 STOCHASTIC MODEL OF INITIATION-TIMING REGULATION
1.3 THE CURRENT TECHNOLOGIES USED FOR ORIGIN IDENTIFICATION BY BULK DATA
1.3.1 SNS-SEQ
1.3.2 BUBBLE TRACK
1.3.3 MCM / ORC CHIP-SEQ
1.3.4 EDU-SEQ-HU
1.3.5 INI-SEQ
1.3.6 OK-SEQ
1.4 THE CURRENT SINGLE-MOLECULE TECHNOLOGIES USED FOR ORIGIN IDENTIFICATION
1.4.1 DNA COMBING
1.4.2 NANOPORE SEQUENCING
1.5 A NOVEL METHOD: ORM (OPTICAL REPLICATION MAPPING)
1.5.1 BIONANO HIGH-THROUGHPUT DNA FIBER MAPPING
MATERIAL AND METHODS, AND BASIC ORM SIGNAL ANALYSES
2.1 CELL LINES
2.1.1 CELL SYNCHRONIZATION
2.1.2 CELL LABELING
2.2 OPTICAL REPLICATION MAPPING
2.3 DATA FORMAT OF BIONANO
2.3.1 BNX
2.3.2 RCMAP AND QCMAP
2.3.3 XMP
2.4 THE CALCULATION OF GENOMIC POSITIONS FOR THE RED SIGNALS
2.5 DATA INTEGRATION BY JAR PACKAGES AND OUTPUT FORMAT
2.5.1 ALLRAWDATAREFINING.JAR AND ITS OUTPUT FORMAT
2.5.2 GENERATEGTF_BYALLDATAREFINING_REFORMAT.JAR AND ITS OUTPUT FORMAT
2.6 HOT SPOTS FILTERING
2.6.1 HOT SPOTS
2.7 SEGMENTATION FOR ORM LABELING SIGNALS
2.8 THE RELIABILITY TEST FOR ORM SEGMENTATION
2.8.1 TRACK THE TRAJECTORY OF SEPARATED REPLICATION FORKS
2.8.2 THE UNEXPECTED LENGTH DISTRIBUTION IN ALL DATASETS.
2.8.3 TWO HYPNOSIS FOR EXPLAINING THE UNEXPECTED LENGTH DISTRIBUTION
2.8.4 VERIFICATION OF POTENTIAL MODEL
2.8.5 REGAINING THE NEGLECTED SIGNALS
2.8.6 THE EXPLANATION FOR SPARSE LABELING
REPLICATION INITIAL ZONE CALLING
3.1 CALCULATION OF NORMALIZED ORM SIGNAL DENSITY
3.2 NORMALIZED SIGNAL DENSITY SMOOTHING
3.3 PEAK AREA RECOGNITION
3.4 CORE REGION REFINING
3.4.1 THE AGGREGATED DENSITY PERCENTAGE
3.4.2 ESTIMATE PROPER SIGNAL PERCENTAGE CUTOFF TO CALL CORE REGIONS OF INITIATION ZONES
3.5 FILTERING AND INITIAL ZONE CALLING
3.5.1 OVERLAPPED REPLICATES NUMBER FILTERING
3.5.2 THE OTHER STANDARD TO ESTIMATE THE QUALITY OF CORE REGION
3.5.3 K-MEANS CLUSTERING FOR IZ LENGTH ADJUSTMENT
FORK DIRECTIONALITY ANALYSIS
4.1 FDI: FORK DIRECTION INDEX
4.2 THE TRIALS FOR IDENTIFICATION OF FORK DIRECTION OF INDIVIDUAL TRACKS
4.2.1 THE MACHINE LEARNING CLASSIFIER
4.2.2 FAILED ATTEMPT TO INTRODUCE THE SECOND LABELING SIGNAL
4.3 GENOME-WIDE REPLICATION KINETICS IN ASYNCHRONOUS CELLS
DEEPER DERIVATIVE DATA MINING FOR ORM IZS
5.1 STOCHASTIC MODEL
5.1.1 EARLY INITIATION EVENTS IN LATE-REPLICATING DOMAINS
5.1.2 LATE-REPLICATING SIGNALS ARE NOT NOISE DATA
5.1.3 FIRING EFFICIENCY IS CORRELATED WITH REPLICATION TIMING
5.1.4 NO SPECIFIC INITIATION SITES
5.1.5 COMPUTATIONAL SIMULATION CONFIRMS THE STOCHASTIC MODEL
5.2 COMPARISON BETWEEN REPLICATION ORIGINS MAPPED BY DIFFERENT
5.2.1 MUTUAL AUTHENTICATION
5.2.2 DIFFERENT FIRE EFFICIENCY AND REPLICATION TIMING COMPARISON
5.3 THE EPIGENETIC MODIFICATION MARKS AROUND INITIATION ZONES
5.3.1 THE EPIGENETIC MODIFICATION MARKS ENRICHED AT ORM INITIAL ZONES
CONCLUSION AND PERSPECTIVES
6.1 MAIN CONCLUSION
6.1.1 ORM – A FUTURE TREND IN INITIATION DETECTION: SINGLE-MOLECULE, CHEAP AND HIGH-THROUGHPUT
6.1.2 DIRECT FIRE EFFICIENCY DETECTION REVEALS THAT INITIATIONS ARE NOT CLUSTERED
6.1.3 ORM DATA SUPPORT A STOCHASTIC MODEL IN REPLICATION TIMING REGULATION
REFERENCE