Improved compression of DNA sequencing data with Cascading Bloom filters

Get Complete Project Material File(s) Now! »

Dynamic programming based methods

The first appearance of biological sequence alignment problem dates back to the 1960s, when first proteins were sequenced. The original goal for aligning two protein sequences was to discover the evolutionary relationships between them. Assuming that they had a common ancestor, matches correspond to the characters from this common ancestor, while the positions where proteins differ represent mutations. The first algorithm to find an optimal alignment of two sequences was proposed by Needleman and Wunsch in 1970 [74]. It implements the so-called dynamic programming paradigm. In general, the Needleman-Wunsch algorithm works as follows: characters of the sequences are put in the 0-th row and in the 0-th column of the n by m matrix, and every cell (i; j) of the matrix contains the optimal alignment of the prefixes S1 : : : Si and T1 : : : Tj . The values in the grid are computed from left to right and from top to bottom, and the value in a cell is calculated with the use of already computed values. The working time of this method is O(nm).
The Needleman-Wunsch algorithm allows us to compare two sequences in their entirety, but may fail to find local similarities. Smith and Waterman in 1981 [75] adapted this method so that it finds local alignments and also works in O(nm) time. Both Needleman-Wunsch and Smith-Waterman algorithms work reasonably fast only on small sequences, as they need to perform a full comparison of two sequences. They are too costly when two genomes have to be aligned to one another, or when many short reads are to be aligned against a long genome. However, under the Hamming or edit distance, if the number of errors is limited by some k, then Smith-Waterman algorithm works in O((n + m)k) time. Moreover, there are several techniques of Smith-Waterman algorithm parallelization. Some of them utilizes properties of modern CPUs and uses SIMD (single instruction – multiple data) instructions such as SSE2. Another algorithms, such as famous Myers bit-parallel algorithm [76], can speed-up alignment under a simple scoring system.

Methods based on seed-and-extend strategy

The seed-and-extend approach consists of two steps: first, all exact occurrences of some set of seeds are found, and then these occurrences are extended using algorithms like the Smith-Waterman one. A seed is characterized by a template which is a sequence of zeros and ones, where 1 corresponds to the position that should be matched, and 0 indicate the position that may or may not match. The number of ones in the seed template is called its weight. The seed “hits” the reference at some position if two subsequences extracted from the reference at this position and from the read according to the seed template are equal. For example, if seed template is 101, the reference is ACAGT and the read is CTGA, then we can extract subsequence C-G (“-” corresponds to 0 from the seed template) from the read and the same subsequence from the reference, and obtain a “hit”. It is assumed that seeds match the reference without errors. If they are comparatively long and only several matching positions are found in the reference, then the second step (extension) can be performed quickly. If the seeding step is also fast, then the overall performance of the algorithm will be good.
The simplest variant of a seed is a seed containing only 1s, and we extract and compare only contiguous k-mers (strings over of length k). Thus, the general seed-and-extend algorithm works in three steps:
1. extract the set of k-mers from the read (all k-mers, or only “informative” ones, may be extracted).
2. these k-mers are queried against a data structure (usually, hash table) that stores all occurrences of all k-mers from the reference genome.
3. perform the extension step for the regions of the reference which contain one or several k-mers from the read.
The first algorithms that followed the seed-and-extend approach were FASTA [77] and BLAST [78]. Interestingly, being one of the first tools implementing seed-and-extend strategy, FASTA package gave rise to the well-known format of the same name. FASTA was followed by BLAST, which became one of the leading tools for sequence comparison. It uses k equal to 11 for DNA data by default, which permits to reduce the search space significantly, while being small enough to fit into memory. The basic BLAST algorithm was improved to be able to find alignments of different types, for example, alignments with gaps. Besides the alignment itself, BLAST provides a very useful feature: it can estimate the probability that each mapping position was found by chance. It was one of the most popular aligners for more than twenty years and has tens of thousands citations nowadays. Many other alignment algorithms follow the seed-and-extend strategy. Some of them, like SOAP [79, 80], Novoalign1, Mosaik [81], BFAST [82], index the reference genome. Other approaches, like MAQ [83], RMAP [84], ZOOM [85], SHRiMP [86], build the index for the reads’ set. The methods based on hash tables usually provide high sensitivity, but occupy a lot of memory.

Estimating the efficiency of a search scheme

To measure the efficiency of a search scheme, we estimate the number of strings enumerated by all the searches of S. We assume that performing single steps of forward, backward, or bidirectional searches takes the same amount of time. It is fairly straightforward to extend the method of this section to the case when these times are not equal. Note that the bidirectional index of Lam et al. [11] reportedly spends slightly more time (order of 10%) on forward search than on backward search. For the analysis, we assume that characters of T and P are randomly drawn uniformly and independently from the alphabet. We note that it is possible to extend the method of this section to a non-uniform distribution. For more complex distributions, a Monte Carlo simulation can be applied which, however, requires much more time than the method of this section.

Table of contents :

I Introduction
1 Motivation and overview
2 Biological context
3 Basic data structures
II Efficient approximate string search
4 Algorithmic methods for read alignment
5 Approximate string matching using a bidirectional index
III Efficient representation of large genomic data with Cascading Bloom filters
6 Algorithmic methods for genome assembly
7 De Bruijn graph representation using Cascading Bloom filters
8 Improved compression of DNA sequencing data with Cascading Bloom filters
IV Metagenomic classification
9 Algorithmic methods for metagenomic classification
10 Data structures for BWT-index-based metagenomic classification
V Conclusions