Engineering methionyl-tRNA synthetase for ligand:substrate binding and catalytic power

Get Complete Project Material File(s) Now! »

A physics‑based energy function allows the computational redesign of a pDZ domain Vaitea opuu1,3, Young Joo Sun2,3, Titus Hou2, Nicolas Panel1, Ernesto J. Fuentes2 & thomas Simonson1

Computational protein design (CPD) can address the inverse folding problem, exploring a large space of sequences and selecting ones predicted to fold. CPD was used previously to redesign several proteins, employing a knowledge-based energy function for both the folded and unfolded states. We show that a pDZ domain can be entirely redesigned using a “physics‑based” energy for the folded state and a knowledge-based energy for the unfolded state. Thousands of sequences were generated by Monte Carlo simulation. Three were chosen for experimental testing, based on their low energies and several empirical criteria. All three could be overexpressed and had native-like circular dichroism spectra and 1D-NMR spectra typical of folded structures. Two had upshifted thermal denaturation curves when a peptide ligand was present, indicating binding and suggesting folding to a correct, PDZ structure. Evidently, the physical principles that govern folded proteins, with a dash of empirical post-filtering, can allow successful whole-protein redesign.
Protein sequences have been selected by evolution to fold into specific structures, stabilized by a subtle balance of interactions involving protein and solvent1,2. In contrast, random polymers of amino acids are very unlikely to adopt a specific, folded structure3,4, and exhibit instead a more disordered structure 5. A powerful approach to understand the evolution of proteins and the basis of folding is to perform computer simulations that mimic evolution. This can be done with computational protein design (CPD), which explores a large set of sequences and selects ones predicted to adopt a given fold6–8. A typical simulation imposes a specific geometry for the pro-tein backbone, corresponding to the experimental conformation of a natural protein. Side chains are mutated randomly. Variants with a favorable predicted folding free energy are saved. The folded state energy function can be physics-based or knowledge-based9–11 while the unfolded state energy is knowledge-based. The protein is considered “redesigned” if most of the protein side chains are allowed to mutate during the simulation.
The successful redesign of complete proteins was reported in 20037,12 and small miniproteins were redesigned even earlier6,13. Several other successes were obtained14– 17, including a study where 15000 miniproteins (40–43 amino acids) were redesigned18. 6% of the designs were shown to be successful; i.e., the designed miniproteins folded into the correct tertiary structure. The others either could not be overexpressed and purified, or did not fold as predicted. All of the applications to proteins described the folded structure with an energy function that was at least partly knowledge-based, or statistical. Statistical energy terms included terms derived from experimental amino acid propensities and evolutionary covariances17, terms derived from inter-residue distance distributions in crystal structures16, and terms derived from torsion angle and hydrogen-bond distance distribu-tions in crystal structures11,14,15. All of the applications described the unfolded structure with a fully statistical, knowledge-based model.
Energy functions for the folded state can also be non-empirical, or physics-based, and taken from molecular mechanics19. There are then only two energy terms for nonbonded interactions between protein atoms, which correspond to the elementary Coulomb and Lennard-Jones effects. Their parameterization relies mainly on fitting quantum chemical calculations performed on small model compounds in the gas phase. The solvent is described implicitly, using varying levels of approximation20. The most rigorous model used so far is a dielectric continuum model21. This requires solving a differential equation, which is technically impractical in a protein design framework. Therefore, a Generalized Born (GB) approximation is more common. GB contains much of the same physics but provides a simpler, analytical energy expression20. GB models have been studied extensively in the context of protein design but also molecular dynamics, free energy simulations, acid/base calculations, ligand binding and protein folding22–25. They reproduce the behavior of the dielectric continuum model rather accurately. Therefore, an energy function that combines molecular mechanics for the protein with a GB solvent can be considered “physics-based”, even though it is not entirely constructed from first principles. A molecular mechanics energy, combined with a very simple solvent model, was used to design two artificial proteins that each consisted of a four-helix bundle, where an elementary unit of 34 amino acids was replicated four times26,27. However, until now, there has not been a complete, experimentally-verified redesign of a natural protein using a physics-based energy function for the folded protein.
Here, we report the successful use of a physics-based energy function to completely redesign a PDZ domain of 83 amino acids. PDZ domains (“Postsynaptic density-95/Discs large/Zonula occludens-1”) are globular domains that establish protein-protein interaction networks28. They interact specifically with target proteins, usually by recognizing a few amino acids at the target C-terminus. They have been extensively studied and used to eluci-date principles of protein evolution and folding29,30. Our design started from the PDZ domain of the Calcium/ calmodulin-dependent serine kinase (CASK) protein. It used the backbone conformation from a new, high-resolution X-ray structure of apo CASK reported here. Several other CASK X-ray structures are also known, with bound peptides. The CASK melting temperature is about 10 °C higher than that of the Tiam1 PDZ domain, which we attempted to redesign earlier33. This increased thermostability could aid in retrieving folded CASK designs. Design was performed by running long Monte Carlo (MC) simulations where most positions were allowed to mutate and all positions could explore a library or conformers, or rotamers. Positions occupied by glycine (seven) or proline (two) were not allowed to mutate. 13 positions that directly contact a peptide ligand in CASK:peptide complex structures (such as PDB 6NID) also kept their wild-type identity. All 61 of the other side chains (73.5% of the sequence) were allowed to mutate freely into any amino acid type except Gly or Pro, for a total of 3.7 × 1076 possible sequences. To describe the folded state, we used a physics-based energy function that combined the Amber molecular mechanics force field31 and a GB solvent32. To describe the unfolded state, we used a knowledge-based energy function33. The Proteus software was used34. Three sampled sequences, or designs were chosen for experimental testing, based on their low energies and several empirical criteria. All three were shown to fold, with good evidence the folded structure was the target, native PDZ fold. In particular, secondary structure content was native-like and binding to one or two peptides that are known CASK ligands was demonstrated for two of the three designs. Therefore, the redesign is considered a success. Evidently, the physical principles that govern folded proteins, as captured by molecular mechanics and continuum electrostatics are sufficient to allow whole-protein design, at least when assisted by a moderate empirical post-filtering. This is encouraging, since these methods give physical insights, can be systematically improved, and are transferable to nucleic acids, sugars, noncanonical amino acids, and ligands of biotechnological interest.

Results

MC simulations were done using the CASK backbone conformation (Fig. 1). The method is detailed in Sup-plementary Material. 61 of 83 residues were allowed to mutate into all types except Gly and Pro. 13 residues known to be directly involved in peptide binding were not allowed to mutate (but could explore rotamers). The exploration did not use any bias towards natural sequences or any limit on the number of mutations. The 2,000 sequences with the lowest folding energies were kept for analysis. Below, we describe their computational char-acterization and the selection of three representative sequences for experimental characterization.
Computational characterization and sequence selection. The top 2,000 sequences spanned a fold-ing energy interval of 1.5 kcal/mol. They were analyzed by the Superfamily fold recognition tool35, which assigns sequences to SCOP36 structural families. None of the top 2,000 Proteus sequences were assigned by Superfam-ily to any other fold in SCOP; all were recognized as belonging to the PDZ family. Blosum40 similarity scores between the designed sequences and natural sequences from the Pfam database were also computed (Fig. 2). The scores were high, and comparable to those of natural PDZ domains. The peaks in the Proteus histograms are narrow, indicating that the 2,000 lowest-energy sequences are similar to each other. Similarities to CASK are in Supp. Material (Fig. S1).
To narrow down the number of sequences for experimental testing, we excluded those with isoelectric points estimated to be close to the physiological pH, between 6.5 and 8.5, which might be subject to aggregation and difficult to express. This reduced the number of sequences from 2,000 to 1,268. Next, we used a criterion of negative design, by considering the confidence levels for the Superfamily assignments to the PDZ family, instead of another SCOP family. Of the 1,268 sequences left, we only retained those that had Superfamily match lengths above the mean value (over the 1,268) and E-values above the mean (log10 E < − 31). This left us with 692 sequences. We reduced the number further using four empirical criteria. (1) We excluded sequences with similarity scores versus Pfam below the mean (over the 692 remaining sequences). This eliminated a window of candidate sequences about 10 points wide, to the left of the mean,plus a few sequences in the lefthand tail of the distribution. We were left with 215 sequences. (2) We excluded sequences that had a cavity buried in the predicted 3D structure. (3) We required a total unsigned protein charge of less than 6. (4) We allowed no more than 15 mutations that drastically changed the amino acid type (defined by a Blosum62 similarity score between the two amino acid types of − 2 or less).
We were left with 16 candidate sequences, shown in Fig.3. They were separated into four groups by visual inspection. Group 2 was eliminated based on its Arg494 residue, absent from CASK homologs. One candidate was selected from each of the other groups (highlighted in Fig. 3), with a preference for native or homologous residue types at positions 492 (candidate 1350), 494 (candidate 1555), and 548 (candidate 1669)—positions that are close to the peptide binding interface. The three candidates were simulated by molecular dynamics with explicit solvent for one microsecond each, and their stabilities and flexibilities appeared comparable to the wild-type (Supplementary Material, Figs. S2–S3). Therefore, the three sequences were retained for experimental testing. The number of mutations, compared to wild-type CASK, were 50 (candidate 1350), and 51 (candidates 1555 and 1669), representing just over 60% of the sequence.
Experimental characterization of selected sequences. Earlier designs based on the Tiam1 tem-plate. Computational redesign of Tiam1 was described earlier33. It used the Tiam1 PDZ domain structure (PDB code 4GVD; Supplementary Material, Fig. S4). The GB electrostatics model included an additional “Native Environment Approximation” (NEA)37, where each atom experienced a constant dielectric environment that corresponded to the native sequence and conformation (see Computational Methods in Supplementary Material). This removed the many-body character of the GB model and made the calculations very efficient. Eight designs were expressed and purified. Their yields were low. CD gave spectra typical of random coil polymers, suggesting the proteins were misfolded (Supplementary Material, Fig. S5). 1D-NMR spectra of the amide region of the NEA designs had limited dispersion and broad resonances compared to the native Tiam1 PDZ domain, corroborating the CD data. An example is shown below; others are in Fig. S6. Differential scanning fluorimetry (DSF) in the presence of known Tiam1 ligands did not show any binding by the Tiam1 NEA designs, while the Tiam1 PDZ domain showed robust binding (Supplementary Material, Fig. S7). Together, these data indicate that the NEA-based designs of the Tiam1 PDZ domain could be overexpressed but adopted unfolded structures, un-able to bind known Tiam1 peptide ligands. Designs based on the CASK template. Next, we characterized the three designs selected above, which we refer to as FDB-1350, FDB-1555, and FDB-1669. They were obtained using as template a new apo CASK PDZ domain structure (PDB code 6NH9, reported here). The Tiam1 and CASK backbone conformations have a small rms deviation of 1.0 Å, despite a low sequence identity of 20.5%. CASK has a ~ 10 °C higher melting temperature, which could facilitate its redesign. The new calculations used a more rigorous GB electrostatics model (Supple-mentary Material), termed the “Fluctuating Dielectric Boundary” model (FDB)38. With this model, the dielectric environment of each atom was updated on-the-fly during the simulation, instead of being represented by a mean environment. The expression yields in E. coli were improved over the NEA Tiam1 designs, though not to the level typically seen with native PDZ domains. In contrast to the NEA Tiam1 designs, CD spectra were similar to native PDZ domains, suggesting these designs were structured (Fig. 4). 1D-proton NMR of the amide region showed good dispersion and sharp lines, consistent with a folded protein (Fig. 5B) and in contrast to the earlier, Tiam1 redesigns (Figs. 5A and S6). The designed proteins’ spectra had noisier baselines, due to a seven- to ten-fold lower concentration, compared to CASK.
We tested the ability of the designs to bind CASK ligands, using DSF experiments. The CASK PDZ domain showed binding to SDC1, Caspr4 and NRXN (Fig 6 and Table 1), as expected. Strikingly, two of the three CASK FDB designs characterized also showed binding to some of the peptides. Thus, FDB-1350 had a significant ther-mal shift in the presence of NRXN and SDC1. FDB-1669 showed a 1.0 °C change in T 1/2 in the presence of the peptide. From these data, we conclude that the three CASK FDB designs were folded and two were capable of interacting with peptide ligands. In principle, the CD and NMR spectra could be obtained with an alternative protein fold, distinct from the target PDZ fold. However, the structural data clearly indicate that the designs are well-ordered and have a secondary structure content similar to the CASK target. Importantly, the ordered character, the secondary structure content, the ability to bind CASK ligands, the structural stability during microsecond MD runs, and the Superfamily classification as a PDZ domain strongly suggest that the designed proteins adopt the target PDZ fold.

Discussion

Protein folding is thought to be induced by protein–solvent and solvent–solvent interactions39, since folding buries nonpolar groups and allows waters to interact with polar amino acid side chains and other waters. In this picture, the protein dielectric properties play a role, with the low-dielectric interior pushing polar protein groups out towards high-dielectric solvent. The protein nonpolar surface also plays a role, with exposed surface leading to fewer water–water interactions40. Thus, it is common to discuss protein solvation in terms of nonpolar and electrostatic components, and most implicit solvent models rely on this separation20. Small proteins have been found to fold correctly in MD simulations with both explicit solvent and accurate implicit solvent models22,41, which can all be considered physics-based. The inverse folding problem is even more complex, since it explores an enormous space of sequences, albeit with a reduced conformation set. Modeling the solvent is a key step to solve this problem, and a key ingredient of our procedure.
The first solvation component in our model is nonpolar and uses accessible surface areas and atomic surface tensions. Nonpolar solvation of a large collection of small molecules correlated well with surface area42, support-ing this treatment. The surface tension parametrization was updated recently, compared to our earlier Tiam1 designs43. Surface interactions in proteins are complex and have a many-body character6, 32, since three or more groups can have surfaces that all overlap. Our model explicitly includes backbone-side chain triple overlaps, while others are accounted for implicitly43.

Table of contents :

1 Introduction
1.1 PDZ domains
1.1.1 Three-dimensional structure of a PDZ domain
1.1.2 Computational studies of PDZ domains
1.1.3 The PDZ domain of the protein Cask
1.2 Aminoacyl-tRNA synthetases
1.2.1 Three-dimensional structures of aminoacyl tRNA synthetases
1.2.2 aaRS engineering for genetic code extension
1.2.3 Computational approaches to genetic code extension
1.3 The Proteus framework
1.3.1 Folded and unfolded models
1.3.2 The energy function
1.3.3 Sampling of sequence space
1.4 Other approaches
1.4.1 A knowledged-based energy function paired with a stochastic search
1.4.2 Combinatorial exploration for an exact solution
2 A physics-based energy function allows the computational redesign of a PDZ domain
3 Engineering methionyl-tRNA synthetase for ligand:substrate binding and catalytic power
3.1 The effect of native rotamers
3.1.1 Results
3.1.2 Conclusions
4 Engineering methionyl-tRNA synthetase for β amino acid activity: background and methods
4.1 Enzyme kinetics and standard free energy
4.1.1 Protein ligand binding
4.1.2 Michaelis-Menten model
4.2 Biological context
4.2.1 Methionine aminoacylation reaction
4.2.2 β amino acids
4.3 Theoretical methods
4.3.1 Design of proteins with a Monte Carlo approach
4.4 Structural models
4.4.1 KMSKS loop conformations
4.4.2 Ligand: force field and catalytic pose
4.4.3 Backbone relaxation
4.4.4 Unfolded state
4.4.5 Catalytic efficiency estimation
4.5 Numerical methods
4.5.1 Energy function
4.5.2 Parameters of MC simulations
4.5.3 Selection of mutable positions with binding site screening
5 Engineering methionyl-tRNA synthetase for β amino acid activity: results
5.1 First search for active variants
5.1.1 Affinity design for β-MetAMP and β-ValAMP
5.1.2 Residence time in molecular dynamic simulations
5.1.3 Selection using catalytic efficiency
5.2 Screening of pairs, formation of β-Met quadruplets
5.3 Design of β-Met quadruplets
5.4 Screening of pairs, formation of β-Val quadruplets
5.5 Design of β-Val quadruplets
5.6 Concluding discussion
6 Design of PDZ pairs with overlapping coding
6.1 Biological context
6.1.1 Natural examples of overlapping codings
6.1.2 Biotechnological applications
6.1.3 Evolutionary hypothesis
6.2 Material and methods
6.2.1 Selected proteins
6.2.2 Overlapping pairs design algorithm
6.2.3 Designed sequence characterization
6.2.4 Molecular dynamics protocol
6.2.5 Ab initio structure prediction
6.3 Results
6.3.1 Overlapping pair designs
6.3.2 Pairs selected for MD
6.3.3 Molecular dynamic validation
6.3.4 Ab initio structure prediction for the C3 pair
6.4 Concluding discussions
6.5 Design perspectives
6.5.1 Quintuplets of nucleotides
6.5.2 From Smith & Watermann to overlapping designs
Conclusion