High throughput design of protein-ligand binding

Get Complete Project Material File(s) Now! »

Computational design of PDZ-peptide binding

The protein design problem can be defined as looking for the amino acid sequences that fit a specific structure according to a scoring function which depends on the system energy. Random mutagenesis is used to produce a set of candidates that are subjected to selective pressure. Since CPD must be eﬃcient, combinatorial approaches are preferred. A simplified and discre-tized conformational space, a simple scoring function and a fast exploration algorithm are key ingredients of all CPD software.
The starting point of a typical CPD simulation is the three-dimensional structure of a given protein. The wildtype sequence associated to this native conformation can then be altered by modification of side chain atoms belonging to one or more residues. This is done eﬃciently if the conformational space is discretized, usually keeping the protein backbone fixed. An energy function is used to score diﬀerent sequences generated by the sampling algorithm. The conceptual diﬀerence between energy function and sequence score is important: the first is the formula used to compute interactions between groups of atoms in a specific configuration. The latter depends on the system energy but can vary according to the quantity to optimize. For example, optimizing protein stability one uses a scoring function which is based on the unfol-ding free energy diﬀerence upon mutation. On the other hand, to optimize ligand binding one has a scoring function which is based on the change in binding aﬃnity. To optimize protonation states, one has to use a scoring function based on protonation free energy diﬀerences, and so on.
In this chapter we start by describing the most popular energy functions in CPD, how the conformational space is usually discretized and some widely-used exploration routines. Then we will give more detailed information about the Proteus software and how sequences are scored according to diﬀerent quantities, with a special interest in ligand binding.

Energy functions for CPD

CPD energy functions are based on molecular mechanics plus an implicit solvent model. The first is borrowed from an existing force field for molecular simulation (for example Amber, Charmm) while the latter is often based on simple dielectric shielding. One widely-used im-plicit solvent model is Generalized Born (GB) combined with solvent accessible surface area (SA), also called GB/SA.
Several CPD softwares use optimized, statistical potentials, where the physical-based energy function terms are weighted or scaled by empirical parameters in order to fit experimental data. However in the present discussion we will focus on physical potentials, in the spirit of our in-house CPD software Proteus. We start by describing molecular-mechanics for internal degrees of freedom, then nonbonded interaction terms.

Implicit solvent models

Implicit solvent models are based on the decomposition of the system solvation free energy ΔGsolv by integration over the solvent degrees of freedom. In general, the total ΔGsolv is split into two parts: the first counts electrostatic interactions between charged atoms and the second counts the nonpolar, non-electrostatic interactions:
ΔGsolv = ΔGelecsolv + ΔGapolsolv (1.6)
The solvation process is decomposed in 3 steps: 1) formation of a solvent cavity which depends on the solute volume/shape, without solute-solvent interactions 2) turn on solute-solvent vdW dispersion interactions, 3) turn on electrostatic interactions. Steps 1 and 2 define ΔGsolvapol while step 3 defines ΔGsolvelec . We now describe how popular implicit models used in CPD express the diﬀerent terms of equation 1.6.
Surface-accessible Surface Area In SASA models, the non-polar solvation free energy is written as a linear combination of contributions from solute atoms:
For each atom i the solvent accessible surface Si [fig. 1.3] is multiplied by a coeﬃcient σi. It should be noted that equation 1.7 is based on a strong assumption of additivity.

Structural models for CPD

Discretization of the protein conformational space Using an implicit solvent model, the cost to compute interactions is aggressively reduced. Protein degrees of freedom must also be simplified in order to explore large ensembles of protein variants in an acceptable computing time. Most CPD implementations use a fixed backbone while side chains assume a discrete set of conformations, or rotamers. These are defined by a rotamer library, which can be backbone dependent (Dunbrack & Karplus [1993]) or not (Tuﬀery et al. [1991]). The use of rotamer li-braries is justified by the fact that amino acid side chains in proteins assume a limited number of configurations, as confirmed by several studies (Finkelstein & Ptitsyn [1977] ; Janin et al. [1978] ; Ponder & Richards [1987]). CPD models can also use native rotamers, extending exis-ting libraries by introduction of side chain configurations extracted from the crystallographic structure of the system of interest.
Pairwise-additive energy function To explore vast ensembles of structures and sequences, CPD software needs to rapidly compute interactions between couples of atoms in a discretized conformational space. In many situations it is convenient to follow a combinatorial approach. The protocol is based on a pre-calculated energy matrix, where pairwise-additive interactions between groups of atoms (in general between couples of residues) are stored and can be acces-sed during the sequence/structure exploration (Dahiyat & Mayo [1997] ; Gaillard & Simonson [2014]). Additional information will be given below, with more details about our in-house soft-ware Proteus.
We describe now the general case, where such an energy function would be Etot = Eii + Eij (1.28)
The sum runs over residues i and j which occupy one of their possible rotamers. Using equa-tion 1.28, a problem arises when one wants to define a pairwise-additive energy function based on molecular-mechanics, where the solvent is described implicitly with one of the previously discussed models, for example GB. Indeed, pairwise diagonal and oﬀ-diagonal terms Eii and Eij should depend only on atomic coordinates of residues i and j. However, the GB interaction between couples of atoms belonging to two residues [eqn. 1.14] has an implicit dependency on all solute atom positions through the atomic solvation radii bi [eqn. 1.15]. Further ap-proximations are needed to make Etot residue-pairwise. One possible solution is to calculate GB interactions between pairs of residues using solvation radii obtained in the protein native conformation. This solution is called the Native Environment Approximation (NEA). Another strategy (Archontis & Simonson [2005]) was recently implemented in Proteus and makes use of a Fluctuating Dielectric Boundary (FDB, Villa et al. [2017]). Another problem arises conside-ring the calculation of the solvent accessible surface (important when using CASA or GBSA). Indeed, considering a couple of residues i and j the intersection between their surfaces can also include their intersection with a third residue, as well as all other residues in the system [fig. 1.6]. To avoid double counting of buried surfaces, Street & Mayo [1998] proposed to apply a correction factor to exclude overlapping volume defined by surfaces of nearby buried residues.

Exploration of the structure-sequence space

Most CPD programs generate a limited number of protein variants, solving problems of re-duced complexity in an acceptable computing time. Indeed, one could be interested in the lowest energy state or a group of sequences close to this Global Minimum Energy Configu-ration (GMEC). Several approaches can be found in literature: some of them are heuristic,

READ Are dissolved and colloidal P species major components of diffuse P losses in agricultural landscapes?

CPD softwares others are stochastic.

Heuristic methods Heuristic methods are useful to determine a group of low energy configu-rations starting from one or more random states. A popular method was introduced by Wer-nisch et al. [2000]. They proposed a simple sampling method, where starting from a random protein sequence, a single position was picked and and its rotamer optimized. The procedure was repeated for many iterations, until a local minimum was found. The method successfully reproduced side chain conformations for surface residues and stability changes for mutations applied in the protein core or at its surface. However, this method generally produces only a few variants. For this reason it is not particularly adapted to perform high-throughput design, where one wants to generate a distribution of many protein variants with a single simulation.
Monte-Carlo methods Monte Carlo (MC) sampling can be used to obtain a set of pro-tein sequences generated from a stochastic process. One popular Monte Carlo method suitable for protein design is based on the Metropolis algorithm. With MC, it is possible to generate a distribution of protein sequences that are distribuited according to a particular probability density function. The system energy is used to accept or refuse new system configurations that populate the desired distribution. For MC, convergence is assured only for a very long simu-lation, and the sampling can be stuck in local minima. However, several advanced sampling techniques can be employed to avoid too long simulations and to jump free energy barriers. More technical information will be given below, with a detailed description about implemen-tation in our software Proteus.
The main advance of Monte Carlo is that is usually simple to implement and can be easily adapted to many diﬀerent problems. Sampling Boltzmann probability, is also possible to extract statistical-mechanics properties (for example free energy diﬀerences) which can be easily related to experimental data or to results of molecular dynamics simulations.
Before introducing our in-house software Proteus, we briefly describe two CPD programs used in the lab during the last few years.
Rosetta is a collection of programs suitable for molecular modelling developed by a large community of researchers (RosettaCommons is an organization that counts 150 developers around the world). RosettaDesign is a utility used for CPD of protein stability; other tools like RosettaDock or RosettaAbInitio are used to predict conformations of protein-ligand com-plexes or de novo protein structure prediction. Several Rosetta energy functions are inspired by molecular mechanics but make extensive use of statistical terms used to fit experimental data. The first versions were based on a fixed backbone and a backbone-dependent rotamer library, while models with limited backbone flexibility were introduced more recently. Despite its remarkable success in the protein design field, the Rosetta energies are usually expressed in unphysical, rosetta units. Even if some eﬀort has been made to translate them to kcal/mol (Alford et al. [2017]), results are still diﬃcult to interpret, especially when compared to those obtained with experiments or state-of-the-art molecular simulations. Moreover, for the design procedure the developers preferred fast algorithms (for example Monte Carlo with Simula-ted Annealing) which are able to generate a few variants in a limited computer time, rather than generating many protein variants with a single run. The user interested in generating an ensemble of sequences often needs to run several independent Rosetta simulations and then discard repeated variants.
Toulbar2 is a C++ solver for cost function networks (CFN) developed at INRA in Toulouse (Allouche et al. [2014]). The program uses a cost function to find the GMEC from a matrix containing interactions between couples of residues (Traoré et al. [2013]). These interactions, however, must be computed using an external program like Proteus or Rosetta-fixbb (fixed backbone design).

Table of contents :

1 Computational design of PDZ-peptide binding
1 Energy functions for CPD
1.1 Bonded and nonbonded interactions
1.2 Implicit solvent models
1.3 Structural models for CPD
2 Exploration of the structure-sequence space
3 CPD softwares
4 The Proteus CPD framework
4.1 Folded and unfolded states
4.2 Proteus Energy Function
4.3 Pairwise residue GB interaction
4.4 Monte Carlo exploration
5 Constant-activity and constant-pH Monte Carlo
2 High throughput design of protein-ligand binding
1 PB/LIE analysis
1.1 Semi-empirical Free Energy Function
1.2 Structural models and simulations setup
1.3 PB calculations
1.4 Results
2 Bias convergence
3 Perspectives: advanced sampling
3 Polarizable free energy simulations
1 Introduction
2 Induced dipole model
2.1 Fields and dipoles
2.2 Electrostatic energy
2.3 Induced dipole force fields
3 Point charge models
3.1 Fluctuating charge model
3.2 Drude pseudo-particle model
3.3 Drude polarizable water models
3.4 Simulating the Drude force field
3.5 MD implementation
4 Standard and relative binding free energies
5 Free energy perturbation
5.1 Thermodynamic integration
5.2 Bennet acceptance ratio
5.3 BAR and TI with Drude
5.4 Alchemical transformation
6 Artefacts in charging free energy calculations
6.1 PME with tinfoil boundary conditions: neutralizing gellium
6.2 Solvent polarization artefacts in a periodic system
4 PDZ-peptide binding specificity with polarizable free energy simulations
1 Introduction
2 Methods
2.1 Structural models and simulation setup
2.2 Alchemical MD simulations
2.3 Drude dual topology
2.4 Spurious interactions with Drude and dual topology
2.5 PME correction
3 Results
4 Conclusions
5 Classical Drude Model for methyl phosphate and phosphotyrosine
1 Introduction
1.1 Phosphotyrosine in Tiam1 binding
1.2 Phosphates in biology
1.3 Earlier additive results, need for polarizability
1.4 Methyl phosphate as a model
2 Methods: overview
2.1 QM quantities and software
2.2 The GAAMP and DGENFF tools
2.3 Parametrization strategy
2.4 Phosphate:Mg2+ binding model
3 Methods: QM calculations
3.1 QM electrostatic potential maps
3.2 QM polarizability and dipole moment
3.3 QM solute–water and solute–ion interaction energy scans
3.4 QM dihedral scans
3.5 QM vibration frequencies
4 Methods: fitting the QM quantities
4.1 Fitting QM potential maps
4.2 Fitting the molecular polarizability
4.3 Fitting solute–water interaction energies and the molecular dipole
5 Methods: phosphate:Mg2+ binding free energies
5.1 Free energy perturbation protocol
5.2 Simulations setup
5.3 Alchemical free energy simulations
6 Results
6.1 Initial MP model and atom types
6.2 Fitting the MP− potential maps
6.3 Fitting the MP− polarizability
6.4 Fitting the MP−–water radial interaction energy scans
6.5 Electrostatic parameters optimization for MP2− and P2−
6.6 Fitting the dihedral energy scans
6.7 Conformational energies of MP− and MP2−
6.8 Fitting QM interactions with Mg and Na ions
6.9 Mg2+:phosphate binding free energies
6.10 Final MP and P2−i topology and parameters
7 Classical Drude model for dianionic phosphotyrosine
7.1 QM quantities
7.2 Electrostatic parameters optimization
7.3 Dihedral parameters and fit
7.4 Final pCRES and pTyr−2 topology and parameters
7.5 Simulating Drude phosphotyrosine in the Tiam1:pSdc1 complex
8 Conclusion
Bibliography