Physics and functional representation: the potential energy function

Get Complete Project Material File(s) Now! »

Monooxygenases: Existing computational studies

The wealth of experimentally solved structures of wild-type and mutant PHBH (and other flavoprotein monooxygenases) in complex with many substrate and cofactor variants have produced a large base of experimental atomic coordinates for computational studies. These indicate that an atomic resolution structure of the Coq6 enzyme can help us understand substrate binding and catalysis phenomena at a similar level of detail. Here we will briefly review five examples from the literature of the interplay between modeling and experiment in the characterization and rational design of FPMOs. First, we will see an example of the rational re-design of ligand binding in phenylacetone monooxygenase based on homology models of the enzyme.85 In the second example we will see molecular dynamics applied to the investigation of the proton transfer network in PHBH.88 The third example describes accessible volume calculations performed on the PHBH structure.89 The fourth example describes the combined use of homology modeling, molecular dynamics, and docking to test computed substrate affinity to experimental dissociation constants.91 Finally, we include an example of quantum mechanical modeling applied to the PHBH system to show the utility of atomic resolution protein structures.93
These examples serve to highlight three types of calculations important for the study of Coq6: computational redesign of ligand binding, molecular dynamics, accessible volumes, and substrate docking. These examples establish a precedent for the modeling strategy and techniques we will apply to Coq6.

Computational redesign of ligand binding based on homology models

PHBH is a type of flavoprotein monooxygenase (FPMO), an important class of enzymes in industrial chemistry, as they allow the stereoselective monooxygenation of substrates. These wild-type enzymes are excellent starting points for modification towards creating enzymes for processing industrial substrates through directed evolution or structure based rational design. Structural knowledge of these enzymes has been used to alter substrate specificity and product stereochemistry through rational designed mutations.
An example of this is the rational redesign of the thermostable phenylacetone monooxygenase (PAMO) by Pazmino et al.85 PAMO’s thermostability makes it a good candidate for an industrial biocatalysis enzyme, but it accepts only a small number of mainly aromatic substrates: phenylacetone, benzylacetone, alpha-methylphenylacetone, 4-hydroxyacetophenone, 2-dodecanone, bicyclohept-2-en-6-one, and methyl-4-tolylsulfide.86 In order to expand the substrate scope treatable by this enzyme, particularly towards aliphatics, the authors turned to a homologous enzyme with greater substrate scope, cyclopentanone monooxygenase (CPMO), but lower stability. The authors identified key residues in the PAMO active site which were not conserved in CPMO, reasoning that these residues in CPMO are the molecular basis for accepting more diverse substrates. These positions in PAMO were mutated to their CPMO counterparts in various combinations and tested for activity, revealing a single point mutation which allowed the binding of a novel substrate.
A key feature of the PAMO work is that the identification of substrate binding residues was done through the comparison of an experimentally solved PAMO structure (determined by crystallography) and a computationally predicted CPMO structure (created by homology modeling). This is an example of the practical utility of an experimentally validated homology model in identifying substrate binding residues and designing mutations. A similar but more detailed strategy for modeling Coq6 enzyme-substrate interactions is developed in Chapter 2 (Computational strategy and methods).

Molecular dynamics studies

Inspection of early PHBH crystal structures identified a network of titratable residues and crystallographic water molecules connecting the active site to the protein surface and bulk solvent: H72, Y385, and Y201.87 The purpose of this network is to transfer a proton away from the substrate to the solvent while preventing direct contact of the solvent to the active site. This was proposed to be accomplished by proton hopping between the titratable residues in the network. This was corroborated by the reduced reactivities of substrates which cannot be deprotonated and of mutations disrupting the proton transfer network. The direction of the proton flow depends primarily on the orientation of the side-chain of H72. While crystal structures can give us the coordinates of the proton transfer network, they cannot tell us about the dynamic behavior of the residues involved, which is particularly important for determining the rotameric state (and therefore orientation) of H72. Molecular dynamics is a method uniquely capable of exploring these conformational states and transitions at atomic resolution. In the case of PHBH, standard molecular dynamics simulations of PHBH in different titration states enabled investigators88 to sample conformations accessible from the crystal structure. In the case of this proton transfer network the functionally relevant protein movements are governed primarily by sidechain rotations. Therefore, standard molecular dynamics simulations on relatively short timescales (sub- microsecond) enabled the authors to sample many relevant conformations for the PHBH system and provide structural explanations for the differing enzyme reactivities caused by different substrates and different mutations to the enzyme.
This is a good example of MD used to simulate functional behavior in this class of enzymes. Similar calculations will be used in the structural analysis and conformational sampling of Coq6 molecular models, as described in Chapter 3 (Construction of Coq6 homology models and stability screening through molecular dynamics).

Accessible volume calculation

Molecular modeling of PHBH began shortly after the resolution of the 1PHH crystal structure with the Analysis of the active site of the flavoprotein p-Hydroxybenzoate hydroxylase and some ideas with respect to its reaction mechanism by Schreuder et al (1990).89 In this work, the authors use molecular modeling to explore the possible positions for the distal oxygen of the flavin-dioxygen adduct through rotation of the O-C4a bond. They found three sterically favorable positions, one of which could correspond to a catalytic positioning, and another which could be compatible with reduction by NADPH. The third position demonstrated an accessible volume on the re face side of isoalloxazine ring. NADPH is likely to appose its hydride bearing nicotinamide ring to the re face of the isoalloxazine. The accessible volume also makes it likely that the dioxygen adduct forms on the re face of the enzyme, since it involves conversion of the planar sp2 hybridized C4a carbon to a tetrahedral sp3 hybridized form. This tetrahedral geometry is bulkier than the planar aromatic geometry of the FAD in its resting state, requiring more accessible volume which can only be found on the re face.
This computationally developed hypothesis for the PHBH peroxo-flavin geometry was crystallographically confirmed with the resolution of the choline oxidase structure 2JBV.90 While not a Class A flavoenzyme like PHBH, choline oxidase also uses a flavin cofactor to perform its reaction. The 2JBV90 crystallization construct formed an oxygenated adduct on the re side of the flavin C4a atom under X-ray illumination, providing a first structure of a peroxo-flavin species co-crystallized in a protein.
This first peroxo-flavin modeling work on PHBH structure 1PHH dates from 1990 and is a prototypical example of the importance of accessible volume calculations. More modern accessible volume calculations will be used in characterizing the Coq6 active site as described in Chapter 2 (Computational Strategy and Methods) and as applied in Chapter 3 (Developing the hypothesis of a substrate access channel).

Challenges of studying the Coq system and the value added of molecular modeling

While continuing work in the field is likely to provide better enzymatic and structural characterization of the pathway enzymes, there are four fundamental challenges that may limit the isolated study of individual Q biosynthesis enzymes. These are: i) enzyme solubility, ii) substrate solubility, iii) the enzymes’ redox system, and iv) the functional interdependence of the Coq proteins in the CoQ synthome.
Together, these challenges suggest that creating in vitro constructs for Coq protein crystallization (and activity assays) will be more difficult than for the case of cytosolic proteins that do not form obligate multi-protein complexes. That is to say, the in vivo study of the Coq enzymes, including Coq6, may progress much faster than their experimental coordinates can be acquired. In this context of challenging structural characterization, molecular modeling has significant value added in developing residue- or atomic-resolution structure-function hypotheses. In this section we will briefly describe the four fundamental challenges of the Coq system and how molecular modeling can contribute to addressing them.

Enzyme solubility

The isolation and purification of enzymes for in vitro characterization requires a good control of enzyme solubility, particularly at the high concentrations used for crystallization. However, because Coq proteins form part of a larger protein complex in direct contact with the inner mitochondrial membrane, their surfaces have likely evolved to form specific protein-protein and protein-membrane contacts. That is to say, it is likely that many of the Coq proteins have evolved to not be soluble in aqueous solution, a property which has hindered the structural resolution of both Coq6 and its bacterial homolog UbiI. This makes molecular modeling of the enzymes relevant to improving the experimental enzyme purification process as well as providing predictive molecular models. Molecular modeling of protein surface properties, such as electrostatic and aromatic surfaces, can be used to rationalize and modify the solution behavior observed for these proteins. This contribution is described in greater detail in Chapter 6 (Research perspectives).

Substrate solubility

In vitro characterization of catalysis requires a substrate, and poor aqueous solubility is a property shared by all Q biosynthesis intermediates. This is because attachment of the polyprenyl tail is one of the first steps in the pathway. Therefore, even if it is possible to create a soluble enzyme construct, providing it with an appropriate substrate in an aqueous in vitro assay may prove difficult. This makes molecular modeling of enzyme-substrate interactions a valuable technique which can generate structure-function hypotheses before structural resolution of the enzyme-substrate complexes. Molecular modeling of these interactions can also inform the design of substrate analogs which can strike a balance between solubility and reactivity. This contribution is described in greater detail in Chapter 4 (Selection of models by substrate docking).
Indeed, the poor aqueous solubility of Q biosynthesis intermediates may have been a contributing factor to the evolution of the CoQ synthome as a membrane-associated protein complex, since the attachment of the polyprenyl tail at the beginning of the pathway is likely to make desorption from the membrane energetically unfavorable for all Q biosynthesis intermediates.

Sequence search by partial pairwise methods: BLAST

The simplest methods of finding templates uses exhaustive pairwise comparisons between the target sequence and potential templates in a protein sequence database (such as NCBI5 or UniProt6). The comparison relies on computing sequence alignments between the target and the sequences in the database. Therefore we will review the basics of sequence alignment, which will also apply to the more detailed alignments to be performed after the initial database search has yielded its results.
There are two classes of sequence alignment methods in common use today: dynamic programming and “word” based methods.7 Dynamic programming methods operate on the complete protein sequences and are theoretically able to find a single global optimum alignment between two sequences, or indeed, any number of sequences. However, the computational expense of this approach makes it impractical to apply to more than 8 sequences at once,8 and a sequence database search contains many more than this. Word based methods are not guaranteed to generate optimal alignments (and therefore find the best results in a database), but they are much faster to compute, and therefore more applicable to searching large databases. The best known examples of this class of methods are FASTA9 and BLAST10.
As implied by the description of “word-based” methods, these methods select a series of shorter, non-overlapping sub-sequences (words) of residues from the query sequence and looks for matching words among the sequences in the database. The presence and relative positions of these “words” are used as sequence-specific features to recognize similarities between protein sequences without actually performing a globally optimized alignment against each. However, because these methods rely on detecting conserved “words”, they will only be able to find matches if they are already quite similar in sequence. These methods perform well when sequence identity between the target and potential templates in the database is high (above 30%). However, when a given database contains only lower sequence identity templates (in the range of 20-30%), word-based searches methods typically find only half of the possible templates. This is because the reliance on exact residue identity on relatively short words makes the search more sensitive to “noise” (or mutations incurred through evolution) than signal (identical residues). When two sequences have less than 30% global sequence identity, there is more noise than signal. That is to say, there are more differences than there are similarities, so recognizing templates by literal residue identity over short stretches is (understandably) likely to miss matches that may exist in the database.
What is needed to overcome this weakness is a way to represent each position in a pairwise alignment more generically, to make the representation and comparison of protein sequences less sensitive to their differences and more sensitive to their similarities. As more advanced methods described in the following sections will show, Coq6 has a sequence identity of 15-20% with respect to its closest templates. We are operating in a zone of low sequence ID where structural divergence between a candidate template structure and the target’s “true” structure is very likely. Therefore, we decided to turn to a class of more sensitive methods to find appropriate templates for Coq6: hidden Markov models. However, we will first need to introduce alignment scoring matrices in more detail, as performing operations on protein sequences with matrices is a common component to both the search and alignment procedures we will use in this project.

READ Dynamic Data Identification Strategy

Sequence search by complete-sequence methods: PSSMs

A protein sequence serves to define a protein structure, and protein structure is more conserved than sequence.4 This fundamental and interesting result of the field means that a single protein sequence is a direct but intrinsically limited way of describing a protein structure. Analysis of structural databases such as SCOP11 and CATH12 reveal that there are about 1300 unique folds that have been catalogued among about 65 000 experimentally solved structures in the Protein Data Bank13 (PDB).
The structural implication of this is that a single protein sequence is always actually a member of a larger family of structural homologs. For the purposes of searching for structural homologs it becomes valuable to represent a particular protein sequence (whether it is of the target or a template) in a way that is more general than a single explicit sequence, yet specific enough to be associated to only a single class of protein shape, that is to say, limited to a single global fold. The more general representation we are looking for can be constructed as a position specific scoring matrix14 (PSSM), also known as a profile. The matrix is a very useful data structure for describing multiple protein sequences, and we will see it again. To construct this PSSM for any protein sequence, we first find a set of close relatives with a simpler method, such as one of the pairwise sequence searches. We then create a multiple sequence alignment (MSA) using more accurate methods for this set of closely related proteins with each sequence occupying a row, and each column containing homologous residues at a generically numbered alignment position. The frequency of occurrence of each amino acid type at each position can be calculated from this MSA.
This allows us to create a new representation of the sequence. It has the general form of a matrix. Each row represents one of the 20 amino acids, and each column represents a numbered position in the MSA. Within each row (which corresponds to one of the 20 amino acids), the value at each column position is the frequency of occurrence of that amino acid among the sequences in the MSA. Thus we have constructed a matrix which describes a protein sequence as a series of residue frequency scores at specific positions: a position specific scoring matrix (PSSM).
For example, we can use a PSSM constructed for a set of templates to compute a compatibility score against a target sequence and evaluate the score to determine if the target really could be a member of the structural family used to construct the PSSM. Alternately, we could construct a PSSM for the target sequence as well, and compute the similarity between template PSSMs and the target PSSM.
The main limitation of this type of PSSM is the inability to include insertions or deletions among their protein sequences. Insertions and deletions are highly likely to occur during evolutionary divergence, particularly when target and template have less than 30% sequence identity. Therefore, while PSSMs give a useful framework for describing sequence variation among closely related sequences, we need a way to include insertions and deletions so we can use PSSMs to describe and detect distantly related sequences – specifically those with similar structures to Coq6.

Table of contents :

Table of Contents
Acknowledgements
Chapter 0 General introduction
1. General introduction
2. Modeling strategy
3. Document structure
Chapter 1 Introduction to ubiquinone biosynthesis
1. General introduction to ubiquinone and its role in cellular metabolism
1. What is ubiquinone?
2. Structure of ubiquinone
3. Functions of ubiquinone
1. A lipid soluble redox agent in the electron transport chain
2. An antioxidant for membrane lipids, proteins, and DNA
3. A structural membrane lipid
2. The ubiquinone biosynthesis pathway in S. cerevisiae
1. Overview
2. Individual Coq proteins
3. Known structures of Q biosynthesis proteins
4. Coq6: Existing experimental data
1. Coq6 amino acid sequence
2. Chemical reactivity: hydroxylation and deamination
3. Protein-protein interactions
4. Clinical relevance of Coq6
3. Structures of Q biosynthesis monooxygenases
1. Introduction
2. PHBH: Holotype of Class A flavoprotein monooxygenases
1. PHBH: Global fold and FAD
2. PHBH: Catalytic cycle
3. Monooxygenases: Existing computational studies
1. Computational redesign of ligand binding based on homology models
2. Molecular dynamics studies
3. Accessible volume calculation
4. Substrate docking
5. QM/MM modeling
4. Discussion
1. Challenges of studying the Coq system and the value added of molecular modeling
1. Enzyme solubility
2. Substrate solubility
3. Enzyme redox systems
4. Enzyme interdependence
5. Conclusion: Questions addressed by the present work
Chapter 2 Computational strategy and methods
1. Introduction
2. Strategy and methods
1. From questions to techniques
2. From techniques to strategy
3. Overview of homology modeling
1. Template searching and alignment
1. The importance of finding good templates
2. Sequence search by partial pairwise methods: BLAST
3. Sequence search by complete-sequence methods: PSSMs
4. Hidden Markov model methods: Phyre2
5. Structure based searching: DALI
2. Sequence alignment
1. Pairwise alignment methods
2. Multiple sequence alignment methods
3. Progressive MSA: ClustalO
4. Iterative MSA: MAFFT-L-INS-I
3. Model building
1. MODELLER
2. I-TASSER and ROBETTA
4. Molecular dynamics
1. Molecular simulation
1. From the macroscopic to the microscopic
2. From particles of matter to systems in phase space
3. Algorithmic implementation of ensemble constraints
4. From cold crystals to warm bodies
2. Molecules: atomic structures and interatomic forces
1. Physics and functional representation: the potential energy function
2. Force-field selection: AMBER99-SB-ILDN
3. Molecular dynamics simulation code: GROMACS
4. Molecular dynamics protocols
5. Accessible volume calculation
1. Voronoi meshes: CAVER
6. Docking
1. Representing binding through docking simulations: AutoDock VINA
7. Computing resources
Chapter 3 Construction of Coq6 homology models and stability screening through molecular dynamics
1. Introduction
2. Template search
1. Sequence based search: Phyre2
2. Structure based search: DALI
3. Top templates: a structural review
1. 4K22
2. 4N9X
3. 2X3N
4. 1PBE
4. The Coq6 global fold can be divided into two regions for homology modeling: N-terminus and C-terminus
5. Coq6 contains an additional subdomain not present in known structural homologs
6. A Coq6-family MSA helps define the insert sequence
3. Model building
1. Modeling strategy: construction of a combinatorial set of multiple template models
1. Generation 1: 4K22 as the Coq6 N-terminal template; no FAD or Coq6-family insert
2. Generation 2: 2X3N as the Coq6 N-terminal template; no FAD or Coq6-family insert
2. Homology models including the insert are used to design constructs for in vivo testing
1. Generation 3: 2X3N as the Coq6 N-terminal template; with FAD and Coq6-family insert
2. Generation 3: Homology models from I-TASSER and ROBETTA
4. Molecular dynamics simulation of Generation 3 constructs
1. I-TASSER Coq6 model (2X3N based)
2. ROBETTA Coq6 model (1PBE based)
3. RATIONAL Coq6 model (2X3N, 4N9X, and 4K22 based)
4. Comparative regional RMSD summary plots
5. Conclusion
Chapter 4 Selection of Coq6 models through molecular dynamics and substrate docking
1. Introduction
2. Selection of Coq6 models by substrate docking
1. Receptor-ligand binding: induced fit vs. conformational selection
2. Receptor-ligand binding as approximated by ensemble docking
3. Preliminary study on substrate models and the Coq6 active site
4. Blind docking of 4-HP (polyprenyl length = 0)
5. Blind docking of 4-HP6 (polyprenyl length = 0)
6. Variations of tail length for computational and experimental approximations
7. Site directed docking of 4-HP with tail lengths of 1-6 isoprene units
8. Docking survey conclusion
3. Enzyme model analysis
1. Active site identification
2. Evolutionary residue conservation
3. Accessible volume calculation: CAVER
4. Molecular dynamics simulations: effective diameter of the tunnels and substrate
1. Substrate model selection
2. Tunnel diameter estimation
3. Atom selection
4. van der Waals radius corrections
5. re face tunnel 1
6. re face tunnel 2
7. si face tunnel 1
8. Conclusion: comparison of the 3 tunnel types
5. Substrate access channel characterization
1. Round 1 of docking: Channel traversability screening
1. Substrate docking into the I-TASSER Coq6 model
2. Substrate docking into the ROBETTA Coq6 model
3. Substrate docking into the RATIONAL Coq6 model
4. Conclusion of Round 1 of substrate docking
2. Round 2 of docking: the RATIONAL model and an active site geometry descriptor
6. Conclusion
Chapter 5 Testing the hypothesis of a Coq6 substrate access channel
1. Introduction
2. Review of known Coq6 mutants
1. The H. sapiens clinical mutation Coq6 G255R mutation corresponds to S. cerevisiae Coq6 G248R
3. MD and substrate docking of the G248R mutant
4. Rational design of novel mutants blocking the substrate access channel
1. MD and substrate docking of the L382E mutant
2. MD and substrate docking of the G248R-L382E double mutant
5. Experimental results
1. In vivo activity assays for Coq6 WT, G248R, L382E, and G248R-L382E
6. Conclusion
perspectives
1. Conclusion of the current work
2. Research perspectives
1. Molecular dynamics with substrate
2. Protein-protein interactions: binary pairs and protein complex architectures
3. Protein-membrane interactions
4. A phylogenetic study of the evolution of the Coq6-family insert
5. Modeling of C-terminal truncation mutants: a role in the deamination of
3-hexaprenyl-4-aminobenzoate?
6. Molecular dynamics over longer timescales
7. Substrate-enzyme assignment through systematic molecular modeling