Nucleic acids preparation
Plant viruses, like many other viruses, are characterized by two properties: (i) they are highly variable and as a group do not share universally conserved sequences that might be used for barcoding approaches and (ii) they are rarely accessible outside of their hosts or vectors (presence in surface waters of some plant viruses would be a counter example to this second property; (Mehle et al., 2014; Mehle et al., 2018; Ravnikar et al., 2018)). A practical consequence of these features is that HTS-based metagenomics studies of plant viruses generally use very complex nucleic acids mixtures that contain both hosts and viral nucleic acids. A range of potential nucleic acids populations can be targeted and have in practice been used in virus discovery efforts. These include total RNA (totRNA) with or without ribosomal RNA depletion, polyadenylated RNA (poly(A)RNA), double stranded RNA (dsRNA), virion-associated nucleic acids (VANA), virus-derived small interfering RNA (sRNA) and RNA after subtractive hybridization with healthy plant RNA (Adams and Fox, 2016; Roossinck et al., 2015; Wu et al., 2015). These methods differ in their efficiency at capturing viruses with different genome types (as listed in the Baltimore classification) and in the enrichment in viral sequences they offer. Their advantages and disadvantages have been reviewed in detail (Adams and Fox, 2016; Roossinck et al., 2015). A brief summary of these approaches is provided here.
Total RNA: one of the most direct approaches, it does not enrich in viral sequences but can detect a large spectrum of RNA viruses, DNA viruses and viroids; its main disadvantage is that large amounts of non-viral sequences are generated, including for host ribosomal RNAs. As a consequence, high sequencing depth is needed, in particular for low titer viruses, making this approach more costly and more intensive in the bioinformatics analysis phase.
Ribo-depleted total RNA: a modification of the total RNA approach in which the plant ribosomal RNAs are removed from the total RNA before sequencing, resulting in a ca. ten-fold enrichment in viral sequences (Adams and Fox, 2016). Similar to total RNA, it allows the detection of all types of viral agents. The cost of this approach remains significant because of the extra cost imposed by the ribo-depletion step.
Poly(A) RNA: similar to ribo-depletion, the purification of messenger RNAs through the selection of poly-adenylated molecules counter-selects the host ribosomal RNAs (and other noncoding RNAs), allowing some level of enrichment of viral sequences. However, viruses with genomes that do not contain a polyA are also counter-selected (Wu et al., 2015).
Small interfering RNA: This approach focuses on the small 21-24 nucleotides (nt) RNAs which are produced by cleavage of viral RNAs by the host Dicer enzymes as a consequence of the antiviral silencing defense reaction (Hamilton and Baulcombe, 1999; Lu et al., 2003). The advantage of this approach is the generality of the silencing defense and therefore, the ability to detect RNA viruses, DNA viruses and viroids (Pooggin, 2018). As for total RNA, a lot of host-derived sequences are generated in parallel with viral ones and the proportion of viral reads may be quite low, in particular in woody species (Massart et al., 2018). In addition, assembly of viral genomes from the small siRNA reads is often not as efficient and straightforward as for the long reads produced by other approaches (Massart et al., 2018).
Double stranded (ds)RNA: this approach is based on the purification of double-stranded RNAs from the analyzed plant sample (Marais et al., 2018). This particular type of nucleic acids is generally absent from non-infected hosts and is produced during their replication by all types of RNA viruses (Weber et al., 2006). Double-stranded RNAs are also sometimes observed for some DNA viruses (possibly as a consequence of incomplete bi-directional transcription termination) but this is not a general feature so that DNA viruses are largely counter-selected by this approach (Roossinck et al., 2015). This dsRNA-based approach has been also used for the discovery of fungal viruses (Roossinck, 2015). Double-stranded RNA purification may provide a high level of enrichment of viral sequences (Roossinck et al., 2010), thus reducing the sequencing power (and associated cost) needed as compared to no/low enrichment approaches.
Virion-associated nucleic acids (VANA): this is undoubtedly the most widely used technique in viral metagenomics (Bernardo et al., 2018; Filloux et al., 2018a; Thapa et al., 2015), in part because it is particularly well suited to analyze viruses present in environmental water samples (Rosario et al., 2009). It is somewhat less direct when host samples are to be used. It relies on the (semi)purification of viral particles by differential centrifugation (Filloux et al., 2015). Non encapsidated nucleic acids are then removed by a nuclease digestion step, before protected viral nucleic acids are finally recovered following the disruption of viral particles. It effectively enriches viral nucleic acids of encapsidated viruses but requires rather complex sample processing. In addition, the way in which its performance might be affected for viruses with unstable particles or by hosts rich in purification-interfering components remains a question.
Nucleic acids selected by subtractive hybridization: it is possible to enrich viral sequences by first performing a substractive hybridization step against healthy host(s) nucleic acids. This approach requires an access to healthy host(s) and involves time-consuming and complex processing; It is therefore considered not well suited in high throughput diagnostic settings but can be useful for etiology studies (Adams et al., 2009).
Sequence-independent sequencing: pointing to.
Amplicon sequencing: it is also possible to sequence amplification products. These can come in the form of rolling circle amplification (RCA) products that have proved useful for the detection or characterization of DNA viruses with circular genomes such as Geminiviridae, Nanoviridae or of viruses with pseudocircular genomes such as Caulimoviridae (Idris et al., 2014; Jeske, 2018; Ng et al., 2011; Rosario et al., 2013). They can also be PCR products obtained using polyvalent, genus or family-specific primers targeting conserved genomic regions. This approach is then very close to the barcoding approaches used in fungal or bacterial metagenomics but with a narrower taxonomic breadth. Given the upstream PCR amplification, this strategy offers higher resolution for the parallel detection of both high and low titer viruses. The amplicon sequencing strategy can also be tuned to study viral intra-specific diversity such as in a study of the diversity of prunus necrotic ringspot virus (PNRSV) in Prunus trees (Kinoti et al., 2017).
Overall, the main difference and advantages/disadvantages of the above approaches mainly concern the spectrum of detectable viruses and the enrichment achieved (with consequences for sequencing depth and cost). There are also some potential considerations on applicability to a wide range of host species. As a consequence, the choice of approach may vary depending of the study objective(s), on the number and complexity of the samples to be analyzed or on the available budget. Given that there have been so far few side-by-side comparisons, it may not be easy to determine the best choice or even whether there exists such a best choice. To gain a clearer vision and reason the choice of target nucleic acids population in a certain context, more comparative analyses are needed. A few such comparisons have so far been performed. For example, a comparison of virus-derived small interfering RNAs (siRNA) and virion-associated nucleic acids (VANA) for a new DNA virus discovery was reported by Candresse et al. (2014). In this case, higher genome coverage and longer contigs were generated using VANA than siRNAs. To test whether the same representation of within-host viral population structure could be obtained, siRNA and VANA-RNA have been compared by Kutnjak et al. (2015). The results revealed that both approaches provided highly similar viral mutational landscapes but also indicated that VANA-derived sequences performed better in complete viral genome reconstruction and allowed to more readily detect recombinant genomes (Kutnjak et al., 2015). The comparison of siRNA and ribosomal RNA depleted total RNA for citrus tristeza virus [(+)ssRNA, Closteroviridae] and citrus dwarfing viroid (Pospiviroidae) characterization in grapefruit showed that rRNA-depleted total RNA is superior to sRNA in de novo genome assembly and coverage for the closterovirus but not for the viroid (Visser et al., 2016). For the detection of viroids and of plant viruses with different genome types in nine different plant samples, the performance of these two approaches was virus-dependent, but longer contigs and higher genome coverage were generated using rRNA-depleted total RNA (Pecman et al., 2017). In the sole study to date that incidentally compared dsRNA and VANA for wide scale metagenomics to describe viral diversity in six native plant species from the Nature Conservancy’s Tallgrass Prairie Preserve in northeastern Oklahoma, the results showed that more operational viral taxonomic units (OTUs) were discovered by the dsRNA approach (29 against seven for VANA). In addition, 86% of VANA-OTUs were also detected by dsRNA. The two approaches also showed different performance when analyzing the effects of sites on virome compositions (Thapa et al., 2015). Overall, while all approaches have proven feasible and yielded interesting results in virus discovery studies in which a limited number of simple samples are generally analyzed, two of them, dsRNA and VANA have been consistently chosen for wider scale metagenomics studies because the enrichment of viral sequences they offer directly translate in lower sequencing costs when a larger number of samples or more complex samples need to be analyzed. However, while these two approaches have been shown to perform well in a range of plants and for a range of viruses, there is still very limited information allowing to reason such a methodological choice in plant virus metagenomics studies.
The first HTS platform, Roche 454 was originally released in 2005. This platform captures a template molecule in a bead that is further loaded on a well of a picotiter plate for amplification using emulsion PCR and finally sequenced using pyrosequencing (Rothberg and Leamon, 2008). The Illumina sequencer, which largely displaced it, is based on sequencing by synthesis using fluorescently labeled dye-terminators and the process of bridge amplification of adaptor-ligated DNA fragments on the glass surface of flow cell (Bentley et al., 2008). The Illumina platform has been and still is the most widely used technology as it provides the highest throughput, lowest error rate and is the most cost effective among currently available HTS platforms (Villamor et al., 2019). SOLiD is a system that utilizes a sequence by ligation method using a DNA ligase (Valouev et al., 2008): it provides the second highest throughput after Illumina but only accommodates 75 bp (100 bp for paired-end read) as the longest read length. The Ion Torrent platform can produce 400 bp read length, however the throughput is still lower than that of the Illumina and SOLiD systems (Rothberg et al., 2011), while the error rate is higher and comparable to the of the 454 pyrosequencing. Different from the above mentioned second-generation technologies, the third-generation sequencing platforms require no template amplification prior to sequencing since individual RNA/DNA molecules are used as templates (Rhoads and Au, 2015; Wang et al., 2015). For example, PacBio-Illumina is the most popular third-generation platform, and uses hairpin adaptors to form a closed ssDNA template called SMRTbell (Rhoads and Au, 2015). This platform can generate very long reads (20 kilobases (kb) and more) but has a high error rate. The other third-generation sequencing platform, proposed by Oxford Nanopore generates similar very long reads but higher error rate output but a lower throughput. On the other hand, it has the advantage of being highly portable in its MinION format (Deamer et al., 2016; Jain et al., 2016). Despite the high error rate, >99% accuracy of consensus sequence has been achieved with the MinION and given the low set up cost and portability of this platform, it has already generated interest in the plant virus field for example for the detection of maize streak virus, maize yellow mosaic virus and maize totivirus in maize plants (Adams et al., 2017), of plum pox virus in plum plants (Bronzato Badial et al., 2018) or of viruses affecting water yam plant (Filloux et al., 2018b). The latter study also compared the performance of the Illumina and MinION platforms for the quality of the genomic sequences obtained, demonstrating that high quality sequences (>99.8% accuracy), very close to Illumina ones can be obtained with the MinION despite its high error rate (Filloux et al., 2018b). Since this technology may provide excellent genome reconstruction together with high consensus sequence accuracy, it might represent the future for viral metagenomics because could solve the problems linked to the short read length, such as incomplete, or chimeric genome assemblies (Filloux et al., 2018b).
Critical methodological points for the implementation of the “dry lab” bioinformatics part of HTS-based plant viral metagenomics
Reads demultiplexing, cleaning, assembly and annotation
Generally, during the library preparation step, individual « barcode » sequences are added to each DNA fragment, which are called Multiplex Identifiers (MIDs) and allow many libraries to be pooled and sequenced simultaneously in a multiplexed format during a single run. While it effectively reduces the cost of HTS, multiplexing however introduces some other problems for the downstream analysis such as mistagging (Esling et al., 2015) or index-hoping (Illumina, 2017; van der Valk et al., 2019) which may results in a low background of inter-sample cross-talk.
A typical HTS dataset is original stored in a proprietary format or as FASTQ files and sequence quality can be evaluated by FASTQC program (Andrews, 2010). The generated reports can be used for the subsequent trimming of low quality reads. The trimmed sequences will be demultiplexed using available softwares (Blawid et al., 2017). After this pre-processing, the most widespread approach is de novo assembly into contigs using a range of pipelines (Villamor et al., 2019) or commercial softwares such as CLC Genomics Workbench (https://www.qiagenbioinformatics.com/products/clc-main-workbench/). This assembly step is in particular known to improve the efficiency of identification of viral sequences and to reduce the volume of the unannotated “dark matter” (Francois et al., 2018). The annotation of sequences and the search for viral ones are conventionally performed by homology searches using Blast (Altschul et al., 1990) or similar programs. An alternative option is to rely on the targeted search of specific conserved motifs using RPS-Blast (Reverse Position-Specific BLAST; (Marchler-Bauer et al., 2009)) for comparison with motifs databases such as PFAM (El-Gebali et al., 2018; Punta et al., 2011), NCBIfams (Haft et al., 2018) or SMART (Letunic and Bork, 2017). On the other hand, if the identification of known viruses is the objective, the pre-processed reads can be direct mapped on reference viral genomes using a range of available tools (Fonseca et al., 2012).
Table of contents :
A BRIEF OVERVIEW OF VIRUSES
PLANT VIRUSES: DIVERSITY WAS MUCH UNDERESTIMATED
APPLICATION OF HIGH THROUGHPUT SEQUENCING (HTS) IN PLANT VIROLOGY
CRITICAL METHODOLOGICAL POINTS FOR THE IMPLEMENTATION OF THE “WET LAB” PART OF HTS-BASED PLANT VIRUS METAGENOMICS
CRITICAL METHODOLOGICAL POINTS FOR THE IMPLEMENTATION OF THE “DRY LAB” BIOINFORMATICS PART OF HTS-BASED PLANT VIRAL METAGENOMICS
GENERAL FEATURES OF PLANT-ASSOCIATED VIROMES
DIFFERENT VIRUS INFECTION PATTERNS IN CROPS AND IN WILD PLANTS?
SCIENTIFIC QUESTIONS ADDRESSED IN THE PRESENT THESIS
CHAPTER Ι MANUSCRIPT » CROP AND WILD PLANTS/WEED SPECIES-ASSOCIATED IROMES IN A HORTICULTURAL CONTEXT: DIVERSITY, PREVALENCE AND STABILITY OVER A TWO-YEAR PERIOD »
MATERIALS AND METHODS
CHAPTER Ⅱ MANUSCRIPT » PHYTOVIROME ANALYSIS OF WILD PLANT POPULATIONS: COMPARISON OF DOUBLE-STRANDED RNA (DSRNA) AND VIRION-ASSOCIATED NUCLEIC ACIDS (VANA) METAGENOMIC APPROACHES
MATERIALS AND METHODS
CHAPTER Ⅲ SUMMARY
MATERIALS AND METHODS
ANNEX A – ADDITIONNAL DATA ON REPRODUCIBILITY OF PHYTOVIROME COMPOSITION ANALYSIS USING RANDOM WHOLE GENOME AMPLIFICATION
ANNEX B – ADDITIONNAL DATA ON THE COMPARISON OF TWO DNA EXTRACTION KITS AND OF ITS1 AND ITS2 AMPLICONS FOR THE ANALYSIS OF FUNGAL COMMUNITIES
CHAPTER Ⅳ MANUSCRIPT » METAGENOMIC ANALYSIS OF VIROME CROSS-TALK BETWEEN CULTIVATED SOLANUM LYCOPERSICUM AND WILD SOLANUM NIGRUM »
MATERIALS AND METHODS
DISCUSSION AND PERSPECTIVES
KEY METHODOLOGICAL ASPECTS
KEY FINDINGS OF ECOLOGICAL RELEVANCE