Genotype, phenotype and the landscape metaphor
In the following paragraphs, I will adopt the molecular perspective. The genotype is here defined as a particular DNA sequence and the phenotype is defined as any observable biochemical property presented by the protein (stability, catalytic properties, etc.). The term function can either refer to the phenotype or to a sum of biochemical properties which confer a role to the protein within the cell (e.g solubility combined with catalytic activity).
Molecular darwinian evolution via natural selection consists in repeated cycles of variation and selection. A parental genotype is replicated in off-spring genotypes bearing variations compared to the parental genotype. Each genotype is expressed as a protein with a phenotype that confers a replication rate to the genotype. If the phenotype is adapted to the envi-ronment, the replication rate is high and vice versa. Generation after gener-ation, the frequency of genotypes that give the most adapted phenotypes, increases in the pool of genotypes. In an environment where the resources for genotype replication are limited, the least adapted genotype-phenotype couples will eventually disappear from the population.
For evolution to occur, a physical linkage must exist between the geno-type and the phenotype so that the selection of the phenotype implies the co-selection of the genotype. In living organisms, such linkage is carried out by the cell membrane. The first visualization of a possible relationship between the genotype and the phenotype, was depicted by Wright . Wright imagined this rela-tionship as a 3D « landscape », where the horizontal axes represent the geno-type space (or « sequence space »), and the vertical axis represents the pheno-type11.3: « low » and « high » phenotypes being respectively less adapted and more adapted phenotypes. Evolution by natural selection can be seen as a process by which a population explores the sequence space and is attracted towards the regions of more adapted phenotypes through genotype varia-tion and phenotype selection.
This « landscape metaphor » is highly questionable. First, the landscape is not static but dynamic: the fluctuating environment might select for dif-ferent functions overtime, in magnitude (e.g different binding affinities) and in nature (e.g binding another ligand). Second, the sequence space is represented as a 2 dimensional continuous space whereas it is highly dimensional and discrete: particular topological properties might emerge from its high dimensionality. Nevertheless, this metaphor is widely used to represent evolution and mutational effects .
The high dimensionality of genotype space or sequence space
The diversity of sequences offered by the sequence space is virtually infinite. The dimension of the sequence space is equal to the number of residues composing the protein, L. With L = 300 (close to the average length of proteins in E. coli ) and 20 natural amino-acids, the sequence space is made of 20300 ≈ 10390 possible combinations. 1To be precise, Wright intended to depict the relationship between the genotype and the reproductive success of organisms, sometimes called « fitness ». The « fitness landscape » metaphor was then extended to the molecular perspective via the concepts of « genotype-phenotype landscape », « function landscape » or « mutational landscape ». Even from a « local » point of view, the diversity if overwhelming as mu-tations accumulate. Let us compute the number of neighbors that are one mutation away from a reference sequence of 300 amino-acids: (20 − 1) × 300 = 5700 neighbors. Let us now consider the two mutations away neigh-bors: ((20 − 1)2 × 300 × 299)/2 ≈ 107 neighbors. In general, the number n of neighbors, k mutations away from the reference is: n = k × 19k Within one family of homologous proteins, amino-acids homology be-tween two different sequences can be less than 25% . A 20% sequence homology accounts for 0.8% × 300 = 240 mutations. The number of neigh-bors that are 240 mutations away from the reference is then ≈ 1064.
Thanks to this available sequence diversity and owing to the diverse physical and chemical properties of the amino-acids, proteins display a multitude of functions: they catalyze reactions (polymerization, conden-sation) or the folding of other proteins (chaperonins), bind other objects (DNA, proteins, atoms, ions, lipids, polymers), play the role of molecular motors, arrange in large structures to constitute the cytoplasmic scaffold of cells (actine filaments, micro-tubules), etc . How is protein function distributed in the vast and highly dimensional genotype-phenotype land-scape?
Questioning the properties of the genotype-phenotype land-scape
Given the individual mutational effects already measured, concepts have emerged to qualify proteins’ properties upon mutation:
• Robustness: the capacity for a protein to accumulate mutations while keeping a constant function. • Epistasis: the context dependence of mutational effects (i.e mutational effects are different in the presence of other mutations).
• Evolvability: the protein potential to evolve towards an improved function or a new function.
The genotype-phenotype landscape is the complex product of at least millions (if only two bodies interactions are considered) of physical and chemical interactions. Those concepts need to be questioned statistically, with high-throughput experimental genotype-phenotype mapping approach:
• What are the parameters of protein « robustness »? How is it distributed into the local sequence space?
• What is the distribution of epistasis? How is it influencing the shape of the genotype-phenotype landscape? If its influence is important, is considering epistatic effects between pairs or residues enough to model the landscape? In other words, which epistatic order is the most relevant to model the landscape?
• What are the parameters of evolvability?
Are those concepts even relevant to describe the genotype-phenotype landscape? If yes, how are they related? If no, what parameters should we use to reduce the high-dimensionality of the landscape? Answering those questions should provide insights to understand how protein sequence en-codes the proteins’ fold and function and how proteins evolve in the se-quence space.
Knowing the genotype-phenotype landscape: opportunities and potential applications
Knowing the properties of the genotype-phenotype relationship thanks to genotype-phenotype mapping would considerably improve our under-standing of molecular evolution. Population and human genetics, protein engineering and protein design would highly benefit from this knowledge. Population genetics and human genetic variation The emergence and rise of antibiotic resistances among bacteria is of great concern . Know-ing the genotype-phenotype relationship of proteins involved in the resis-tance against antibiotics would reveal the mechanisms of resistance at the molecular level. It would also highlight the evolutionary paths available towards improved or new resistance. Such insights could be used to push their evolution towards evolutionary dead-ends. The same ideas could be applied to prevent virus propagation and predict their evolution in order to conceive effective vaccines against emerging strains .
In the field of human diseases, understanding the relationship between protein sequences and functions would allow to predict the impact of rare mutations in the frame of complex diseases . Protein engineering and design Most of the proteins used in the industry (from biotechnology to textile manufacturing) are engineered proteins [19, 20]. Many were evolved using directed evolution, a laboratory technique which mimics evolution by natural selection, to enhance the properties of natural or artificial proteins, such as their stability, activity or solubility. Di-rected evolution was notably applied on:
• Antibodies evolved towards higher affinity or different specificities to bind factors involved in inflammatory reactions .
• Enzymes evolved towards higher stability (to be processive at high temperatures), or higher catalytic capacities .
Directed evolution consists in generating genotype variation from a nat-ural protein, assaying the variants for a given function and selecting the best of them for a new variation selection cycle. Protein can also be de-signed computationally, but this technique leads to proteins characterized with low activities compared to natural proteins, and directed evolution has to be used to optimize them [21, 22] (see figure 1.4).
In some cases, directed evolution is limited: if the starting point of the directed evolution process is already « trapped » in a phenotype peak and the diversity generated is not large enough to escape it, the improvements will be limited. Also, proteins properties are often coupled: mutations that increase the activity might be destabilizing, mutations that increase the sol-ubility might impair the protein’s function [23, 24]. This phenomenon, termed pleiotropy, is not well understood and limits the optimization of enzymes. As a result, high-throughput methods to measure the phenotype of millions or proteins are required to screen the largest possible libraries in order to increases the chances of discovering suitable genotypes. Also, a better knowledge of the genotype-phenotype landscape would allow to design better strategies to improve proteins’ properties without impairing their function.
Current high-throughput genotype-phenotype map-ping approaches: deep mutational scanning exper-iments
Carrying out an experimental genotype-phenotype mapping consists in:
• (i) Generating a library of variant sequences where a chosen set of residues of a protein of interest are mutated.
• (ii) Measuring the phenotype of each variant through a protein assay.
• (iii) Sequencing its genotype.
The scope of genotype-phenotype mapping experiments can be limited by the capacity to mutate sequences, the throughput of the protein assay and the throughput of sequencing. A classical genotype-phenotype map-ping experiment consists in measuring the mutational effect of each residue of a protein being independently mutated to alanine. In 1989, Cunningham et al.  measured the effect of 62 such mutations on the human growth hormone to map the binding epitope of the human growth hormone recep-tor. The scope of this experiment was limited to 62 residues mutated to alanine because each mutant had to be expressed, assayed and sequenced in separate experiments. Therefore, it was not possible to study subtle and heterogeneous perturbations offered by the diversity of available mutations towards amino-acids other than alanine, on all the residues constituting the human growth factor protein.
This is now reachable. Thanks to mutagenesis and sequencing develop-ments, that can be interfaced with high-throughput protein assay, hundred of thousands to millions of protein variants can be studied simultaneously in what has been termed « deep-mutational scanning » experiments.
Generate sequence variants by mutagenesis
To produce a library of variant sequences, two possible methods that are based on Polymerase Chain Reaction (PCR) are available: random mu-tagenesis and site-directed mutagenesis. Random mutagenesis As in a classical PCR reactions, primers flanking the 5’ and 3’ regions are designed so as to amplify the sequence of interest in a PCR reaction. But contrary to a faithful PCR reaction, the polymerase incorporates errors into the newly synthesized sequences of DNA. This can be achieved using a modified polymerase, adding destabilizing ions to the buffer of the reaction (manganese instead of magnesium for instance) or using an unbalanced concentration ratio of the four deoxyribonucleotides phosphate .
• Advantages: Many mutants can be generated in one single reaction. All of them are cloned at once, simultaneously.
• Disadvantages: It is a random process following a Poisson distribution: the average number of mutations can be controlled by varying the amount of template DNA, the number of cycles, the polymerase or the buffer conditions. But the number of mutations cannot be con-trolled at the level of the individual sequence. Mutations requiring two or three mutations on the same codons are very unlikely to occur. Hence, this technique is limited to amino-acid mutations character-ized by single nucleotide mutations. The process is biased towards certain nucleotide mutations. This approach also generates non-sense codons which can be selected and enriched in the library if the protein is toxic for the host used for cloning. Site-directed mutagenesis In site-directed mutagenesis, primers which contain the desired mutations are used to amplify the sequence of interest. The mutated amplicons are cloned in the desired vector.
• Avantages: The mutations are designed. Non-sense codons should not appear if not designed. The final library contains an arbitrary ratio of mutations, independently of their requirements in terms of nucleotide mutations.
Table of contents :
1.1 The genotype-phenotype relationship
1.1.1 Genotype, phenotype and the landscape metaphor
1.1.2 The high dimensionality of genotype space or sequence space
1.1.3 Questioning the properties of the genotype-phenotype landscape
1.1.4 Knowing the genotype-phenotype landscape: opportunities and potential applications
1.1.5 Approaches to understand the genotype-phenotype landscape
Numerical approaches from the first principle of physics
and protein structural information
Statistical inference on protein sequence data
Experimental approaches: the need for large scale genotypephenotype
1.2 Current high-throughput genotype-phenotype mapping approaches: deep mutational scanning experiments
1.2.1 Generate sequence variants by mutagenesis
1.2.2 Opportunities and challenges of next generation sequencing
1.2.3 Mapping the genotype to the phenotype in deep mutational scanning experiments
General considerations about deep mutational scanning
Limitations of those approaches
Overcoming those limitations for a quantitative genotypephenotype mapping
1.2.4 Properties of the genotype-phenotype landscape as seen through deep mutational scanning
Distribution of robustness and agreement with the phylogeny
Distribution of beneficial mutations
Distribution of epistatic effects
Influence of robustness on molecular evolution
Influence of epistasis on molecular evolution
The roles of promiscuity and evolvability in molecular
Trade-offs and couplings between proteins’ properties
1.2.5 Performing genotype-phenotype mappings of enzymes is limited with current high-throughput approaches .
1.3 Droplet-based microfluidics to perform high-throughput enzymatic assays
1.3.1 Droplet-based microfluidics: high-throughput manipulation of micro-metric reaction vessels
The opportunities offered by the picoliter droplet format in biology
The development of bio-compatible materials dedicated to microfluidics
Microfluidic devices to perform high-throughput operations
1.3.2 Constraints of the droplet format on performing quantitative enzymatic assays
General droplet-based microfluidics constraints
Constraints specific to enzymatic assays
1.3.3 Current droplet-based microfluidic work-flows
1.3.4 Conclusion: requirements for a quantitative genotypephenotype mapping of model enzymes in droplets .
1.4 Studying model enzymes to better understand the link between genotype and phenotype
1.4.1 Streptomyces griseus Aminopeptidase (SGAP) to study allostery and promiscuity
1.4.2 Ratus norvegicus trypsin (rat trypsin). Are trypsin sectors independent functional units? What is the relation between the trypsin sectors and epistasis?
2 Material and methods: droplet-based microfluidics and fluorogenic substrates
2.1 Droplet-based microfluidics
2.1.1 Microfluidic devices fabrication
2.1.2 Fabrication and preparation of disposable devices to handle aqueous phases and emulsions
2.1.3 Microfluidic devices operation Optical and electrical setup
Preparing the microfluidic devices
Making and manipulating droplets
Measuring droplet fluorescence
2.2 Fluorogenic substrates
2.2.1 SGAP fluorogenic substrates
2.2.2 Trypsin fluorogenic substrates
3 Development of a cell-free microfluidic work-flow for the genotypephenotype mapping of Streptomyces griseus Aminopeptidase (SGAP)
3.1 Previous work on a cell-free microfluidic work-flow to assay SGAP in droplets
3.2 Development of the in vitro workflow 1: pico-injecting the substrate in droplet containing the enzyme
3.2.1 The PCR reagents inhibit cell-free expression in bulk 63
3.2.2 Cell-free expression is successful in bulk and droplets 63
3.2.3 Cell-free expression reagents inhibit SGAP enzymatic activity in bulk: picoinjection is incompatible with activity detection
3.2.4 Cell-free expression droplets have to be diluted in the assay droplets
3.3 Development of the in vitro workflow 2: diluting the enzyme containing droplet into the substrate containing droplet
3.3.1 Fusing 2 pL expression and 20 pL assay droplets allows aminopeptidase activity detection with high contrast
3.3.2 Synthesizing a new non leaky substrate to improve SGAP enzymatic assay in droplets
3.3.3 SGAP PCR amplification is successful in 0.2 pL droplets
3.3.4 0.2-2pL droplet electro-coalescence development
3.4 Discussion, Conclusion and Perspectives
4 Development of an in vivo microfluidic workflow for the genotypephenotype mapping of Ratus norvegicus trypsin
4.1 The in vivo workflow 1: rat trypsin periplasmic expression in E. coli
4.1.1 Osmotic shock of E.coli cells in hypotonic buffer allows trypsin activity detection in bulk
4.1.2 MUGB inhibits trypsin activity but cannot be used as a reporter for its concentration
4.1.3 Normalizing rat trypsin activity using mCherry as a reporter of its expression level
4.1.4 Difficulties with the E. coli expression system
4.2 The in vivo workflow 2: trypsin secretion by B. subtilis
4.2.1 The rat trypsin – mCherry fusion protein is secreted by WB800N as a full protein in the supernatant .
4.2.2 The rat trypsin – mCherry fusion protein is fluorescent and enzymatically active in the culture medium
4.2.3 The mCherry fluorescence can be used as a reporter of the trypsin expression level
4.2.4 Optimizing incubation time for the trypsin-mCherry expression in droplets
4.2.5 Shaking the emulsion during incubation for expression reduces emulsion size polydispersity
4.2.6 Measuring the catalytic efficiency of trypsin variants in bulk
4.2.7 Measuring the catalytic efficiency of trypsin variants in droplets
4.3 Towards a library of all rat trypsin single point mutants
4.3.1 « Around the horn » site directed mutagenesis principle
4.3.2 Designing the mutagenic primers
4.3.3 Performing saturated mutagenesis on the rat trypsin protein
4.3.4 Analyzing the first rat trypsin library by deep sequencing
First deep sequencing run results
Second deep sequencing run results, with fraction of
the mutagenic primers redesigned.
Coverage of all single point mutants
4.4 Discussion, Conclusion and Perspectives
5.1 Conclusion and further work
A Appendix: Development of a cell-free microfluidic work-flow for the genotype-phenotype mapping of Streptomyces griseus Aminopeptidase (SGAP)
A.1 Experimental details
A.1.1 Plasmid maps and primer sequences
A.1.2 General comments about the experiments
A.1.3 SGAP PCR protocol with Phusion polymerase .
A.1.4 The PCR reagents inhibit cell-free expression in bulk
A.1.5 Cell-free expression is successful in bulk and droplets
A.1.6 Cell-free expression reagents inhibit SGAP enzymatic activity in bulk: pico-injection is incompatible with activity detection 3.2.3
A.1.7 Fusing 2 pL expression and 20 pL assay droplets allows aminopeptidase activity detection with high contrast
A.1.8 SGAP PCR amplification is successful in 0.2 pL droplets
A.1.9 Random mutagenesis on the SGAP gene
A.2 Synthesis of New Hydrophilic Rhodamine Based Enzymatic
Substrates Compatible with Droplet-Based Microfluidic Assays
B Appendix: Development of an in-vivo microfluidic work-flow for the genotype-phenotype mapping of Ratus norvegicus trypsin
B.1 Ratus norvegicus trypsin expression in E. coli
B.1.1 Plasmid maps
B.1.2 General comments about the experiments
B.1.3 Osmotic shock of E.coli cells in hypotonic buffer allows trypsin activity detection in bulk 4.1.1, 4.1.3
B.1.4 Lysozyme and sucrose only marginally improve trypsin activity detection 4.1.1
B.1.5 MUGB inhibits trypsin activity but cannot be used as a reporter for its concentration 4.1.2
B.1.6 Development of a non leaking substrate based on the fret pair EDANS-Dabcyl 4.1.4
B.1.7 Rat trypsin in B. subtilis: plasmid maps
B.2 Ratus norvegicus trypsin expression in B. subtilis
B.2.1 General comments about the experiments
B.2.2 B. subtilis transformation protocol
B.2.3 Induction protocol
B.2.4 Droplet induction protocol
B.2.5 Bulk assay protocol
B.2.6 Droplet assay protocol
B.2.7 The rat trypsin – mCherry fusion protein is secreted by WB800N as a full protein in the supernatant 4.2.1 166
B.2.8 Measuring the catalytic efficiency of trypsin variants in droplets 4.2.1
B.3 Towards a library of all rat trypsin single point mutants
B.3.1 Designing the mutagenic primers 4.3.2
B.3.2 Performing saturated mutagenesis on the rat trypsin protein 4.3.3
B.3.3 Library preparation 4.3.3
B.3.4 Redesigning part of the mutagenic primers
B.3.5 Sequencing data analysis