Mass spectra interpretation in the context of the ProteoCardis cohort

Get Complete Project Material File(s) Now! »

Shotgun metagenomics

Shotgun metagenomics consists in sequencing all the DNA from all organisms in the sample. Since it is not limited to a particular organism, this sequencing method makes it possible to capture bacterial and archaeal genomes (in the same way as targeted metagenomics) but also host, fungi and viruses. It thus allows to have a finer overview of the microorganisms composing the ecosystem.
Moreover, since no part of the genome is preferentially sequenced, this method enables to identify new genomes that had never been observed. Indeed, in the tar-geted metagenomics method, species identification is based on 16S rRNA genes al-ready identified in databases since it uses primers based on the already known 16S rRNA gene sequences. Although the most abundant organisms are the most repre-sented in shotgun metagenomics results, the random nature of shotgun sequencing ensures that low-abundant organisms of the gut microbiota are still represented.
Lastly, shotgun metagenomics gives direct information on the potential functions encoded by the metagenome. This makes it possible to identify metabolic pathways potentially used by the gut microorganisms.
However shotgun metagenomics requires a much larger sequencing depth (num-ber of reads) in order to capture the genomes of low-abundant microorganisms. Reads are fragments of sequenced DNA; their generation is fully developed in Sec-tion 1.2.2.2. While targeted metagenomics requires 50 000 – 100 000 reads to identify bacterial in a sample, shotgun metagenomics requires several millions of reads. It results in much higher sequencing costs, as well as much heavier downstream bioin-formatic processing. In addition, genomes’ coverage variability may be a barrier to taxonomic assignment, which can be done at the species level only if the genome coverage is sufficient. Nevertheless, shotgun metagenomics is a method offering the greatest potential for identification of bacterial species and their functional poten-tials [16].
In the context of this thesis work, our aim was to study the metaproteome func-tionalities of the gut microbiota, and reference metagenomic databases are a pre-requisite for the study of the whole protein composition of a microbial community (Section 1.3). We therefore used shotgun metagenomics for the sequencing of the metagenomes of the patients in the ProteoCardis cohort, a project developed in Sec-tion 1.6. Several sequencing tools can be used to sequence the metagenome, and they rely on different technologies and measurement methods. We will present here the two main sequencing approaches: Illumina and Ion Torrent.

Sequencing technologies

Illumina technology

The Illumina sequencing technology has the particularity of amplifying extracted DNA through a bridge technique, and sequencing it through the detection of pho-tons during the polymerisation.
First, the double-stranded DNA is fragmented into pieces of about 200 kilobases (kb) by transposomes, which also allow for the attachment of primers to the ends. Then short amplification cycles allow for the attachment of the adapters with two distinct oligonucleotides sequences. Amplification is performed by PCR.
Both types of adapters are attached by covalent bind to a flowcell. Then, DNA strands (denatured to become single-stranded) randomly hybridise to the adapters by complementarity. The reverse strand is synthesized through a polymerase, in-cluding the adapter located at the other end of the DNA. The denatured complemen-tarity DNA is bound to the flowcell and bridges are created between the adapters attached to the flowcell. These bridges are then amplified, which creates clusters with high density amplified sequences. This type of amplification is called « bridge amplification ».
After amplification, the reverse strands are cleaved from their adapter, leaving on the flowcell only the forward strands that are sequenced. The four types of nu-cleotides are added to the flowcell, marked with different colours. Incorporation of the complementary nucleotide into the sequenced strand is determined by the colour emitted by the cluster after excitation by a laser. Incorporation of the four nu-cleotides at each cycle and detection of the emitted colour allows for strand sequence computation. Nucleotide incorporated at each position is detected in parallel on all the clusters of the flowcell, which allows an extremely fast sequencing (Figure 1.3).
FIGURE 1.3 – Bridge amplification and Illumina sequencing. (A) DNA is fragmented into 200kb fragments and primers are attached by transposomes. The amplification adds two types of adaptors at each end. (B) The DNA fragments are floated onto a flowcell and elon-gated by DNA polymerase. (C) The unattached strands are washed by denaturation. (D) The strands form bridges at the surface of the flowcell and are amplified by cycles of polymerisation/denaturation.
(E) The antisense strands are cut and washed, leaving only sense strands. (F) The sense strands form clusters of the same sequence of DNA, here a cluster of grey DNA and a cluster of black DNA. (G) At each cycle, the four types of nucleotides are introduced into the flowcell and are incorporated by polymerisation. After washing, a laser excites the last nucleotide incorporated, which emits a distinc-tive colour. (H) At each cycle, the clusters are sequenced in parallel thanks to the colour they emit. The succession of the colours at each
cycle determines the sequence of the DNA of each cluster.

Metagenomics

Ion TorrentTM technology

DNA fragmentation can be done by physical methods (acoustic shearing or sonica-tion) or enzymatic methods (non-specific endonuclease cocktails). DNA fragments are then selected according to their size. The desired fragment sizes is determined by NGS instrumentation’s limitations and by the specific sequencing application. Selected fragments are ligated with primers P1 and A at their end and amplified by four or five PCR cycles. The obtained DNA fragments with adapters constitute the library (Figure 1.4 A-B).
In order to obtain enough DNA to reach a threshold of detection of the necessary and sufficient signal to perform the sequencing, the library undergoes an emulsion PCR: this is the clonal amplification step. For this PCR, primer A coupled to biotin, and primer P1 coupled to the adapter B are used. These primers are introduced into microreactors in the presence of a sphere and a single DNA fragment. PCR in these microreactors amplifies the introduced DNA fragment, thus managing a monoclonal population of this fragment. Adapter B binds fragments to the sphere. At the end of the PCR, the sphere will be covered with clones of the fragment initially intro-duced. Under ideal conditions, each microreactor would initially contain one DNA fragment and one sphere. In order to eliminate spheres on which there has been no clonal amplification, for example because there was no fragment initially introduced into the microreactor, streptavidin is used. Magnetic beads coated with streptavidin will bind to biotin that has been coupled with primer A, and then, thanks to a mag-net, only spheres bound to biotin, and thus to DNA fragments, will be recovered. The set of spheres covered with clones of DNA fragments constitute the sequencing matrix, which will be used in the Ion Torrent sequencer (Figure 1.4 C-D).
The sequencing matrix is introduced into microwells on a semiconductor chip, so that a single sphere linked to numerous copies of a single DNA fragment is in-troduced into a single well, in the presence of DNA polymerase. Then the well is flooded alternately with a solution containing one of the four deoxyribonucleotides (dNTPs) at a pH of 7.8. When the correct dNTP is incorporated by the DNA poly-merase to synthesize the complementary strand to the fragment of interest, the for-mation of the new phosphodiester bond releases an H+ ion as shown on Figure 1.5.
FIGURE 1.5 – Incorporation of a dNTP to DNA with DNA polymerase at the 3’ end. The phosphodiester bond creation releases a pyrophos-phate (PPi) and a H+ ion which will be used to determine the se-quence of the fragment.
thanks to a hypersensitive pH meter placed under each well, and therefore deduces the sequence of the fragment. Quality of the base incorporation signal gives a qual-ity score to each incorporated dNTP, corresponding to a probability of base miscall (sequencing error). This probability of sequencing error is called a Phred score, and is encoded with ASCII symbols in the output of the sequencing. The sequencing with Ion Torrent technology is illustrated on Figure 1.6.
FIGURE 1.6 – Sequencing with Ion Torrent technology. A single sphere coated with clonal DNAs is introduced into each well. Each dNTP is introduced sequentially. If the dNTP is incorporated by poly-merisation, it releases H+ ions which are detected by pH-meter under each well.
If several dNTPs are incorporated, i.e. if the fragment contains several identi-cal bases next to each other, pH change is greater, which is also detected by the pH meter that deduces the number of incorporated dNTPs. Thus, the DNA strands in-troduced into each well are sequenced simultaneously in several millions of wells. The resulting sequencing output is therefore millions of sequenced fragments, called reads. Although this is a fast and efficient method of sequencing, one limitation is the retranscription of homopolymers (a large number of identical dNTPs next to each other). Indeed, since the release of H+ ions is proportional to the number of in-tegrated bases, it is difficult to accurately measure the number of bases incorporated when they are numerous. The sequencing errors of this technology thus lies mainly in the counting of dNTPs in homopolymers.

READ Morphologic changes due to anthropogenic influences

Assembly

The assembly of the metagenome is a step that leads, from the sequencing reads, to a catalogue of genes that is crucial for metaproteomics data analysis (Section 1.3.3.1). The microbiota being a complex ecosystem whose all components are poorly known, its metagenome is assembled « de novo », which means that it will be built without any prior knowledge of its bacterial species. The de novo assembly differs from the reference-guided assembly, in which the reads are aligned on reference genomes of the studied ecosystem and the contigs reconstruction is performed by inference thanks to the reference genome. A contig is a set of reads whose sequences are over-lapping, thus defining a long consensus DNA sequence. The reference-guided as-sembly requires reference genomes representing the ecosystem, and alignment to the reference genome is highly time-consuming. This method of assembly is therefore particularly suited for genomics, where few genomes are studied and the number of reads is limited. In the case of metagenomics, where the number of reads is counted in tens of millions, the time devoted to the alignment is a first obstacle to the use of this method. In addition, the number of reference genomes is limited in the context of gut microbiota. De novo assembly is therefore preferred, which makes it possible to capture the genome of still unknown microbial species.

Table of contents :

1 The gut microbiota: a challenging complexity
1.1 The gut microbiota in humans
1.2 Metagenomics
1.2.1 Different approaches
1.2.1.1 Targeted metagenomics
1.2.1.2 Shotgun metagenomics
1.2.2 Sequencing technologies
1.2.2.1 Illumina technology
1.2.2.2 Ion TorrentTM technology
1.2.3 Assembly
1.3 Metaproteomics
1.3.1 Metaproteomics experimental workflow
1.3.2 LC-MS/MS analyses
1.3.3 Interpretation of LC-MS/MS data
1.3.3.1 Peptide-spectrum matching
1.3.3.2 Protein identification and grouping
1.3.3.3 Spectral Counting quantification
1.3.3.4 eXtracted Ion Chromatogram quantification
1.3.4 The rise of metaproteomics in the last decade
1.3.5 The challenges of metaproteomics
1.3.5.1 Sample preparation
1.3.5.2 Bioinformatics analyses
1.4 Cardiovascular artery diseases
1.5 Evidence for a relationship between gut microbiota and CAD
1.6 The ProteoCardis project
1.7 Thesis objectives
2 Mass spectra interpretation without individual metagenomes
2.1 The ObOmics study
2.2 Scientific questions
2.3 Methods
2.3.1 Samples preparation and injection
2.3.2 Interrogated databases
2.3.3 Interrogation strategies
2.3.3.1 Classical identification
2.3.3.2 Iterative identification in two steps
2.3.3.3 Iterative identification in three steps
2.3.3.4 Iterative identification used in the experiments
2.3.4 Construction of the datasets
2.3.5 Peptide and subgroup quantification
2.3.6 Evaluation criteria
2.4 Results
2.4.1 Gain of identification with MetaHIT 9.9
2.4.2 Identifications specific to each database
2.4.3 Reproducibility of the identifications with MetaHIT 3.3 and MetaHIT 9.9
2.4.4 Gain of identification with the iterative strategy
2.4.5 Identifications specific to each interrogation strategy
2.4.6 Reproducibility of the identifications with the classical and the iterative strategy
2.5 Conclusion
3 Two examples of clinical data interpretation with MetaHIT databases
3.1 Metaproteomic features related to weight loss
3.1.1 Scientific context
3.1.2 Methods
3.1.3 Results
3.1.4 Conclusion
3.2 Metaproteomic features related to intestinal bowel diseases
3.2.1 Scientific context
3.2.2 Methods
3.2.3 Results
3.2.3.1 Metaproteomic profiling of stool samples
3.2.3.2 Search for IBD signatures in stool samples
3.2.3.3 Search for signatures between IBD phenotypes
3.2.4 Conclusion
4 Mass spectra interpretation in the context of the ProteoCardis cohort
4.1 Methods
4.1.1 Metagenomics sequencing
4.1.2 Metaproteomic analyses
4.2 Development of MetaRaptor
4.3 Assembly of the individual metagenomes of the ProteoCardis cohort
4.4 Performance of the individual catalogues
4.5 Metaproteome landscape in cardiovascular diseases
5 XIC quantification
5.1 Challenges of XIC quantification in metaproteomics
5.2 Chromatographic alignment
5.3 Correction of XIC
5.4 Imputation of missing data
5.4.1 Imputation of missing values in classic proteomics
5.4.2 Imputation of missing values in metaproteomics
5.4.3 Imputation implemented in the ProteoCardis study
6 Can the ProteoCardis data be improved by normalization methods?
6.1 Ascertainment of the batch effect
6.2 Normalization methods
6.2.1 Methods for SC normalization
6.2.2 Methods for XIC normalization
6.3 Evaluation of the normalizations
6.3.1 Normalization of SC
6.3.2 Normalization of XIC
6.4 Conclusion on the correction of technical variability
7 Exploration of statistical approaches for biomarker discovery
7.1 Methods
7.1.1 Multiple testing approach
7.1.1.1 Resampled FDR
7.1.1.2 Modelling of SC
7.1.1.3 Modelling of XIC
7.1.2 Random forests approach
7.1.2.1 Principle
7.1.2.2 Typical preprocessings
7.1.2.3 Implemented parameters and validation scheme
7.2 Results
7.2.1 Preliminary: evaluation of the batch effect
7.2.2 Results with multiple testing
7.2.3 Results with random forests
7.2.4 Relationship between the two approaches
7.3 Perspectives on statistical analysis
8 Conclusion and perspectives