Long read assembly approaches using methods inspired by de Bruijn graphs

Get Complete Project Material File(s) Now! »

Overlaps and their impact on assembly

As we de ned in the introduction, when two sequences share a common substring, we say that they overlap or that one of them maps on the other see 2.1a. It is possible for sequences to share a common substring just by chance (because of the 4-letter alphabet) but the probability of this event decreases when the length of the common substring increases. Intuitively, this probability gets smaller as common substring gets longer. If the reads does not contain sequencing errors, the only criteria to evaluate whether a common substring is « true » or not could be the length of this substring. However, the number of errors in the sequence of readings breaks this paradigm and forces us to integrate the errors when assessing the quality of an overlap. Figure 2.1b show an overlap with two mismatch. There are two base pairs that do not match between the two sequences – knowing if this overlap is « true » or not isn’t obvious. In this document we distinguish two tasks in similarity search between two sequences: mapping: one tries to nd the position of a read in a larger sequence (e.g. comparing di er-ent datasets, experiments from di erent sequencing generations, or between the reads and an assembly, against a reference genome) overlapping: one tries to nd which reads share a common substring with other reads (e.g. nding common substrings in the same dataset) .Even if mapping and overlapping can be seen as di erent tasks, one could observe that the same tools, and underlying algorithmic, can be used to solve both tasks.
The seed-and-extend strategy. Search of similarity between two or more DNA sequences has many links to plain text search. Seed-and-extend is an approach used by many tools to nd similar sequences between a target (e.g. a reference genome) and a query (e.g. a read). The idea is to nd a high similar subsequence (often exact), namely the seed (or anchor), and then to extend this seed to have a larger common subsequence. Tools that implement this approach usually create an index of the target. This index need to answer to a simple question: is a given subsequence exist in target and at which position. Each query is processed and its substrings are searched in the index. This gives a set of seeds that can be extended through alignment techniques such as dynamic programming. If the alignment score reaches a given threshold, a hit is reported.
Many tools for mapping and overlapping use seed-and-extend strategy. The most popular is Blast [3, 4]. Speci c tools dedicated to sequencing are BWA [52] or blasr [15]. Implementation details of indexes, size and number of anchors change between tools. For example BWA or blasr use a FM-index [26] to perform anchor search.
The seed-only strategy. With NGS technology development more and more data had to be pro-cessed and the seed-and-extend strategy was replaced by a seed-only strategy. Indeed, the extension step is still very time-consuming. With the seed-and-extend strategy, an overlap is scored by its length and the number of errors in the alignment. With the seed-only strategy we don’t have an alignment. The overlap is thus scored using the number of seeds and their positions.
This strategy was used in SGA [98] assembly tools, during overlapping step SGA search exact overlap between low error read, by search a substring at end of read in a FM-index.
Speci city of long-reads (longer reads, high error rate) has relaunched this research eld. Chu et al. produce an interesting review about some of third-generation overlap search in [22], discussed in next section. We can cite Hisea [39], Daligner [73], MHAP [44] and Minimap2 [56, 57] as overlapping tools they use this strategy to found overlap between thrid generation overlap. We will give some details on MHAP and Minimap2 in section 3.

Improve genome assembly e ciency by reducing the quantity of information

Error in third generation reads make it more di cult to found overlaps between reads. Current techniques attempt to optimize results on real data [22]. Actually, a key observation is that within the overlaps found by state-of-the-art tools, not all of them are useful to downstream analysis. For example Miniasm keeps only end-to-end overlaps, and Canu keeps only the two longest end-to-end overlaps for each read (see 3 for more details).
we lter overlapping information with positive (or at least no negative) impact on assembly results ? One may hope to at least decrease the disk space and may be to increase the speed of assembly. In section 2.4 we present fpa (for Filter Pairwise Alignment), our solution to lter out useless overlaps. An overlaper output can be piped directly to fpa. fpa can apply several lters based on length of read, length of overlap, type of overlap, read name. Some simple fpa lters reduce the computation time of assembly without e ect (or a small positive e ect) on assembly.

Read scrubbing: an alternative to read correction

Assembly tools are based on reads. If your reads are bad, your assembly will be bad. To continue on the analogy given in the introduction, you probably cannot reconstruct a book if crazy monks gave you only fragments with half of the letters being erroneous. Correction of reads, with a mix of sequencing technologies or with a single technology, can help to get better reads. But actually, correction tools have an important cost in term of computation time and memory usage. Moreover it’s hard to distinguish mutations (e.g. true SNPs) from sequencing errors, and sometimes interesting mutations are consider as errors and are corrected (thus removed). The pre-processing correction step is particularly important for long-reads data because of high error rates that can lead to more errors and misassemblies. Tools like Mecat [112], CONSENT [68] uses overlap information to pick reads that share same sequences and build a consensus from the alignements induced by overlaps. A similar task, called polishing, is run after assembly, as a post-processing task. Reads are mapped against assembled contigs and contig sequences is corrected using reads, we can cite Racon [104] and CONSENT.
The more a read contains errors, the more the correction step require reads. But the sequencing depth is not homogeneous. Thus the corrector will be more or less e ective depending on regions and the depth of coverage thereof. If the sequencing depth is too low, the correction may discard some reads. To solve this problem it is necessary either to work without correction or to return to raw reads. Correction of reads before assembly can generate some trouble in assembly by remove some important information. At the best of our knowledge the only one reads corrector that tries to keep the heterozygotie during correction is falcon [20]. Heterozygotie is very useful to understand genetic diversity in population or some genetic diseases. Another example concerns genomes that contain approximate repeats. The correction step tends to correct both region in order to make them identical. By the way, correction creates a repetition that cannot be solved by the assembler although regions could be distinguished prior to correction.
Nevertheless, long-reads still contains very low quality region [74] that can lead to fragmented assembly [109]. It is thus necessary to lter out thos regions. An alternative to correction can be scrubbing: one removes only very low quality region and keep all other information. To found and remove this very low quality region and read we created yacrd (for Yet Another Chimeric Read Detector). yacrd uses self overlapping information to compute a coverage curve and identi es regions of low coverage. We hypothesise taht such low coverage regions are of low quality (see section 2.4 for more details on this tool).
Our paper « yacrd and fpa: upstream tools for long-read genome assembly » presents two tools, yacrd, and fpa. yacrd focuses on the detection and elimination of very poor quality regions. fpa focuses on ltering ’useless’ overlaps.

READ INFORMATION ETHICS WITHIN THE CONTEXT OF INFORMATION AND KNOWLEDGE SOCIETIES

Overlap Layout Consensus

An alternative to the Greedy approach is the Overlap Layout Consensus (OLC). We can nd a rst de nition of OLC in [71] in 1995. The most popular assembly pipeline based on OLC is probably Celera [66, 69]. This approach is based on a graph where a read is a node and we build an edge between nodes if reads share an overlap. Figure 3.2, presents the OLC corresponding to the overlap seen in If we reuse analogy we introduce in section 1.2 we can see this graph as an ordering of the chapters of a book provided by a crazy copyist monk. An edge indicates this piece of text was before that piece of text in the original book. As we can see in Figure 3.2 a repetition creates a fork in OLC, a node with two successors. It’s easy to detect this case in the graph and stop this contig construction. The assembly result of this graph is 3 sequences with white nodes, green nodes and red nodes. The assembly is more fragmented than with the Greedy algorithm but does not contain any misassembly. By analyzing the graph, we will be able to detect the paths without branching node and to reconstruct the corresponding sequence by merging the sequences present in the graph. OLC-based tools help to avoid misassemblies but the search for overlaps between reads is still time-expensive. The graph construction consumes a lot of memory, and more cleaning steps and analysis are expensive in computation time compared to a Greedy approach.

Algorithms and heuristics to simplify assembly graphs

The graph structure was useful to get a comprehensive view of all the information provided by reads, but having too much information can create problems, slow down the assembly tools and increase their costs in memory or at worst lead to misassembly or to unnecessary fragmentation of the assembly.

Transitive edge

In Figure 3.2 you can notice the edge from R1 to read R3, this overlap is exact. We can nd an overlap between R1 and R3. But this edge does not provide new information, we know R1 is before R3, this edge is called a transitive edge. We can give a more formal de nition of a transitive edge: in a directed graph, if we have a set of edges (a; b) (b; c) and (a; c), the edge (a; c) is transitive. Myers proposed in [70] another assembly graph, the string graph, which is an overlap graph with no transitive edge. By reducing the number of edges in the graph, the string graph simpli es the traversing of the graph and decreases the memory impact.
With string graphs, we just need to follow a simple path (a path in which each node has only one successor) to build assembly without misassembly.

Table of contents :

1 Introduction
1.1 Sequencing
1.2 The genome assembly task
1.3 Thesis outline
2 Preassembly
2.1 Overlaps and their impact on assembly
2.2 State-of-the-art long reads overlapper-compare
2.2.1 Introduction
2.2.2 Materials & Methods
2.2.3 Results
2.2.4 Conclusion
2.3 Improving assembly by ltering out overlaps and scrubbing
2.3.1 Improve genome assembly eciency by reducing the quantity of information
2.3.2 Read scrubbing: an alternative to read correction
2.4 yacrd and fpa: upstream tools for long-read genome assembly
2.4.1 Abstract
2.4.2 Introduction
2.4.3 Materials & Methods
2.4.4 Result & Discussion
2.5 Chapter conclusion
3 Long reads assembly tools state of the art
3.1 Greedy assembly algorithm
3.2 Overlap Layout Consensus
3.3 Algorithms and heuristics to simplify assembly graphs
3.3.1 Transitive edge
3.3.2 Contained reads
3.3.3 Bubble and tips
3.4 The advantages of long reads
3.5 A Pipeline with correction Canu
3.5.1 Overlapping
3.5.2 Correction
3.5.3 Trimming
3.5.4 Assembly
3.6 Pipeline without correction Miniasm
3.6.1 Minimap2
3.6.2 Miniasm
3.7 Long read assembly approaches using methods inspired by de Bruijn graphs
3.7.1 Flye
3.7.2 wtdbg2
3.8 New long read assembly method
3.8.1 Peregrine
3.8.2 Shasta
3.9 Chapter Conclusion
4 Post Assembly
4.1 Assembly evaluation
4.2 Misassemblies in noisy assemblies
4.2.1 Introduction
4.2.2 Datasets, assembly pipelines, analysis pipelines; versions and parameters
4.2.3 Quast misassemblies denition
4.2.4 Eect of min-identity
4.2.5 Eect of extensive-mis-size on misassemblies count
4.2.6 Conclusion
4.2.7 Take home message
4.2.8 Acknowledgements
4.3 Trouble with heuristic algorithm
4.4 Graph analysis of fragmented long-read bacterial genome assemblies
4.4.1 Abstract
4.4.2 Introduction
4.4.3 Related works
4.4.4 Methods
4.4.5 Results
4.4.6 Discussion
4.5 Conclusion
5 Other contribution
5.1 Labsquare
5.2 CAMI challenge 2
5.2.1 Dataset description
5.2.2 Assembly strategy
5.3 10X linked-read deconvolution
6 Conclusion
Bibliography