an overview of the human adaptive immune system
Adaptive immunity response has two major types of immune cells: T and B cells. Such cells, also called lymphocytes, have cell surface antigen receptors, respectively called T cell receptors (TCR) and B cell receptors BCR, capable of recognizing and responding to an unlimited number of pathogens.
generation and maturation of lymphocytes
Both B and T lymphocytes originate in the bone marrow, but only B lymphocytes mature there; T lymphocytes migrate to the thymus to complete their maturation. B and T lymphocytes that have matured but have not yet confronted antigens are known as naive lymphocytes. Such cells circulate continually between the blood and the peripheral lymphoid tissues. If an infection occurs, mature naive lymphocytes with receptors recognizing the infectious agent are held in the lym-phoid tissues. These cells are activated and start to divide, giving rise to clones of antigen-specific cells that mediate adaptive immunity to fight the infection. Some of the proliferating B cells differentiate into effector cells generating anti-bodies, the soluble form of BCR, and some develop into memory B cells, capable of evoking an enhanced response to reinfection. Antibodies, through various mechanisms, help eliminate pathogens and their toxins. Any substance capable of eliciting an adaptive immune response is referred to as an antigen. Since the work presented in this dissertation deals exclusively with B cells and their recep-tors, they will be discussed in more detail in the following section.
basic structure of b cell receptor
BCR sequences determine the B-cell antigen-binding properties. In order to react to a wide variety of pathogens, the immune system needs to generate an equiv-alent variety of BCRs. However, to have individual genes encoding the number of different types of BCR, the entire human genome should be dedicated to lym-phocyte receptor generation. Therefore, the recombination of preexisting genes creates a part of the required diversity of BCRs .
The BCR consists of two types of components: the recognition unit, struc-tured by a membrane Immunoglobulin Ig protein, and the transmembrane signal unit formed by the CD79a and CD79b molecules. An Ig is a heterodimer com-posed of two Immunoglobulin Heavy chain (IgH) and two Immunoglobulin Light chain (IgL) bound by disulfide bridges (Figure 2-A). Each chain has two distinct parts: the variable domain on the N-terminal side responsible for antigen recog-nition and the constant region on the C-terminal side attached to the cell surface. Three gene groups encode the IgH variable domain: Variable (V), Diversity (D), and Joining (J). They are clustered in loci on human chromosome 14q32, but dur-ing early B-cell ontogeny, one gene from each gene group is randomly selected and joined together,by two successive rearrangement events. This leads to the formation of a complete Variable domain encoded by a VDJ-REGION. Joining is imprecise as nucleotides are randomly deleted and inserted in the V-D (N1) and D-J (N2) junctions (Figure 2-B). Altogether such a process is known as VDJ re-combination, and it is responsible for the production of highly diversified « naive » BCR repertoire.
As shown in Figure 2-C., the variable domain, after VDJ rearrangement, con-tains a beta-sheet Framework region (FR), that maintains the structure of the Ig molecule. FR are relatively conserved and support three hypervariable stretches spatially close to each other and form loops that interact directly with antigens. For this reason, they are called complementarity determining region (CDR). The CDR3 is at the junction of the IGHV, IGHD, and IGHJ genes, and has the highest variability, and plays a crucial role in determining antigen properties. The IMGT unique numbering for V-REGION [29, 30] has allowed redefining the limits of the CDR and FR Regions, known as CDR-IMGT and FR-IMGT. We have used these definitions in this work.
Antigen-activated naive B cells undergo rapid proliferation (clone expansion) and further diversify their BCR by Somatic HyperMutation (SHM), an enzymatically-driven process introducing mainly point substitutions into the Ig locus. In the normal process of SHM, the variable domain, and not the constant region of the expressed heavy chains, are mutated. The mutation rate is estimated to be of the order of 103 to 104 per base cell per cell generation, and there are hotspots and coldspots of SHM that have been described [31, 32].
Library construction and technical strategies
Most sequencing libraries for sequencing platforms have been generated from unpaired Ig heavy chains and light chains in the published literature. The prox-imity of rearranged V(D)J genes allows for polymerase chain reaction (PCR) am-plification of the entire V(D)J-REGION using various gene-specific primer strate-gies. Most amplification strategies are selected to include the CDR3 containing sufficient information to examine many of the antigen-recognizing features of the receptor molecules.
Libraries can be generated from sorted lymphocyte populations, peripheral blood mononuclear cells (PBMCs), or lymphocyte-containing tissues. An essen-tial technical decision to make before starting a Rep-Seq analysis is the choice of template. Two starting materials can serve as the initial template to sequence Ig repertoires: genomic DNA (gDNA) and mRNA.
1. Using gDNA has the advantage of sampling and analyzing both produc-tive and unproductive V(D)J rearrangements. Unproductive rearrangement happens when the IGHV and IGHJ are not in the same frame. Even though it does not give rise to functional proteins, sequences of unproductive re-arrangements can provide helpful information on features like gene rear-rangement frequencies, base deletion, and non-templated base addition in junction regions, receptor diversity, and selection in lymphocyte develop-ment . Also, the copy number of the gDNA template per cell is consistent (only one productively rearranged heavy and light chain locus per cell). It can be used to evaluate and quantify clonal frequencies and expansions .
2. Using mRNA as an initial template requires an additional step to con-vert mRNA to DNA via reverse transcription. The number of Ig mRNA transcripts can vary widely among different B cell subpopulations, which prevents the reliable quantification of expanded clonal lymphocyte popu-lations while using mRNA as the template. One way of overcoming this inconvenience is to separate cells into different replicate aliquots before isolating the mRNA. Using mRNA as a template, on the other hand, can increase the likelihood of capturing a more exhaustive representation of rare clones due to the existence of multiple Ig transcript copies per cell.
Several approaches for sequencing of lymphocyte receptor repertoires can be taken, depending on the research questions of a particular experiment. The data used in this work were collected during routine diagnostic procedures at Pitié-Salpêtrière hospital in Paris. Sequences were obtained from peripheral blood lymphocytes by performing polymerase chain amplification of IGH-VDJ rear-rangements on genomic DNA followed by NGS paired-end sequencing on an Illumina MiSeq platform. Typically 105 sequences were obtained per sample. In the following chapter, we will detail how to analyze the BCR repertoire starting from the raw output of sequencing platforms.
sequence analysis and clustering clonally related sequences
The two initial steps of B-cell population structure inference named VDJ assign-ment and clonal grouping (or clone expansion prediction), have a tremendous impact on the success of the following phases. VDJ assignment consists in de-tecting IGHV, IGHD and, IGHJ germline genes used in the VDJ recombination process, where clonal grouping finds clusters of BCR sequences that might have been derived from the same precursor.
VDJ germline assignment
The V(D)J germline assignment is one of the most critical steps when treating Rep-Seq data. This step aims to infer the correct V, D, and J germline genes and alleles that were recombined to produce each BCR sequence. An germline infer-ence is required to correctly identify somatic hypermutations for each sequence, cluster them into clonal groups, and carry out an appropriate diversity approx-imation. Frequently, the germline inference applies an algorithm to choose the best match among a set of potential germline genes from a database of known genes and alleles. The current public database for Ig germline genes, the Interna-tional ImMunoGeneTics information system , is the most used reference for an accurate VDJ assignment. It is important to highlight that the inference for D genes is particularly challenging because they tend to be short and modified during the rearrangement.
Table of contents :
1.1 Overview of the study
ii background and problem statement
2 studying immune repertoires
2.1 An overview of the human adaptive immune system
2.2 Generation and maturation of lymphocytes
2.3 Basic structure of B cell receptor
2.4 The practical aspects of measuring BCR repertoire’s diversity
2.4.1 The sample size
2.4.2 The capacity of current sequencing instruments
3 bioinformatics pipelines and repertoire analysis
3.2 Sequence analysis and clustering clonally related sequences
3.2.1 VDJ germline assignment
3.2.2 Clonal grouping
3.3 Repertoire characterization and analysis
3.3.1 Diversity Profiles
3.3.2 Mutation analysis
3.3.3 Clonal Evolution / Evolution of repertoire /clonal dynamic
4 the definition of clone
5 a communication model for optimizing rep-seq clinical use
6 the problem statement
iii proposed solutions
7 agreeable; a bcr repertoire clonal grouping method with an application for intra-clonal analysis in clinical settings
7.2 Material and Methods
7.2.1 The algorithm
7.2.2 Data sets
7.2.3 Performance evaluation
7.3.1 Reconstruction simulated repertoire’s clonal architecture
7.3.2 Parameter optimization
7.3.4 Outputs’ interpretability
8 performance evaluation of bcr clonal grouping algorithms
8.2 Material and Methods
8.2.1 Clonal grouping methods
8.2.2 BCR high throughput sequencing data
8.2.3 Performance evaluation
8.3.1 Simulated repertoires
8.3.2 Artificial monoclonal repertoires
8.3.3 Experimental benchmarks
9 reconstructing the evolutionary history of a bcr lineage using minimum spanning tree and clonotype abundances
9.2 Material and methods
9.2.1 Problem statement
9.2.2 Minimum spanning Tree
9.2.3 A modified Prim’s algorithm
9.2.4 Editing the reconstructed lineage tree
9.2.5 Tools used in the comparisons
9.2.6 Data sets
9.2.7 Tree comparison and evaluation
9.3.1 Reconstructing BCR lineage trees from simulated data
9.3.2 Biological validation using BCR sequencing data
10 viclod, a tool for visualizing b cell repertoire’s clonal and intra-clonal diversities
10.2 Description of functionalities
10.2.1 Clonal analysis
10.2.2 Intra-clonal diversity analysis
10.2.3 Pruning trees for a better interpretation
10.2.4 Intra-clonal diversity analysis
10.3 Use case
iv conclusion and perspectives
11 conclusion and perspectives
11.0.2 Direction for future work
a airr file’s required fields for viclod pipeline
b comparison of bcr clonal grouping tools’ performance on simulated repertoires