Protein Molecules and their 3D Structures
Why Study Proteins?
Proteins are one of the major groups of macromolecules essential to all living organisms. Pro-teins perform many biological functions and they participate in virtually all processes within biological cells. For example, proteins participate in cell signaling, molecular transportation, and cellular regu-lation, and they also act as structural elements and components of the immune system.
Building Blocks and Architecture of Proteins
Amino Acids are the Building Block of Proteins. Proteins are made up of polypeptides. A polypeptide is formed when amino acids covalently join to each other in a sequential manner releas-ing water molecules (Figure 2.1). A polypeptide is thus composed of a chain of amino acid residues (or simply “residues”). This sequence defines the “primary” structure of a protein. All of the 20 common amino acids have a central carbon atom (Cα) to which are attached a hydrogen atom, an amino group, a carboxy group and a side chain. What distinguishes one amino acid from another is the side chain (often known as the R group) attached to the Cα (Figure 2.1). This side-chain varies from a single hydrogen atom in glycine to a large aromatic group of atoms in tryptophan. A typical protein contains 200–300 amino acids, but some are much smaller and some are much bigger.
Levels of Protein Architecture. When a protein is being made by the ribosome, its polypeptide chain is linear and non-functional. To become functional, the polypeptide has to fold and coil (Fig-ure 2.2) into some unique stable 3D structure (often called its “tertiary” structure).3 This occurs through intermediate forms, when regular segments of the polypeptide fold locally into stable 3D structures (secondary structures) called α-helices and β-strands. Regions with no specific sec-ondary structure are called loop or irregular regions. Some examples of secondary structures are illustrated in Figure 2.3. Many proteins are formed by the association of more than one folded polypeptide chain. The resulting structure is often called the “quaternary” structure of a protein. According to Anfinsen’s dogma (Anfinsen et al., 1961, Kresge et al., 2006), the primary sequence of a protein determines its tertiary structure. More generally, the central dogma of molecular biol-ogy states that the sequence of amino acids is determined by the sequence of nucleotides in the gene encoding it. Thus, a protein’s amino acid sequence determines its 3D structure which in turn determines its biological function.
Protein Molecules and their 3D Structures
Because protein molecules can contain many thousands of atoms which are too many to visu-alise simultaneously, the 3D structures of proteins are often drawn using e.g. simplified “cartoon” or “ribbon” representations to illustrate stable secondary structures. There also exist other graphical representations, e.g. atoms only, atoms with connecting bonds, and the molecular surface of the protein (Figure 2.4).
Databases of Protein Sequences. Databases of protein sequences are valuable resources for the study of protein biological function. The Universal Protein Resource (UniProt; Apweiler et al., 2004b) is the main publicly available protein sequence resource. UniProt is a multi-database re-source. For example, UniProt’s Swiss-Prot contains non-redundant high-quality annotated protein sequences, whereas UniProt’s TrEMBL contains redundant automatically annotated sequences. The annotation includes the function of the protein, the processes the protein is involved in, the type of cell the protein is located in, and the post-translational modifications (review by Apweiler et al., 2004a). These annotations are often described using a controlled vocabulary of terms called the Gene Ontology (Ashburner et al., 2000) to allow consistent descriptions of proteins and thus to facilitate database queries.
Protein Domains and their Classifications
What is a Protein Domain? Proteins are often composed of one or more structural subunits called domains. A domain is a compact region of protein structure that is generally made up of a continuous segment of amino-acids, and is often capable of folding sufficiently stably to exist on its own. For example, Figure 2.5 shows the 3D structure of a osteonectin protein which consists of three domains, namely FOLN (PF09289; blue), Kazal_1 (PF00050; red) and SPARC_Ca_bdg (PF10591; green). Domains vary in size, but most are around 200 amino acids or less. On average, a protein is folded into approximately two domains (Sali et al., 2003). In the evolution of proteins, different combinations of domains give rise to the diverse range of proteins found in nature. In structural classifications of proteins and domains, a “domain family” is a group of domains sharing similar structural folds, whereas a “protein family” is a group of single-domain proteins or multi-domain proteins. (Copley et al., 2002). In this thesis, the term “protein domain family” is used to refer to either a protein family or a domain family. A domain is often associated with a given function. Therefore, identifying domains within proteins may provide insights into their function (Copley et al., 2002 and references therein). For these reasons, protein sequence databases often classify and organize their sequences into protein domain families and superfamilies.
Sequence-Based Domain Definitions. There exist computational methods to identify domains in proteins. Because sequence data is more abundant than structural data and since protein sequence determines protein structure, most domain definitions are based on the identification of conserved sequences (Copley et al., 2002). Examples of sequence-based domain definitions include Pfam (Finn et al., 2010) and SMART (Letunic et al., 2009). These approaches usually involve collecting and aligning similar sequences automatically, manually editing the alignment to improve quality, and performing an iterative search to identify other related sequences using a hidden Markov model (HMM) sequence profile. For example, the current version of Pfam contains 13,672 protein domain families. Figure 2.5 illustrates the Pfam domain assignments for a given sequence.
Structure-Based Domain Definitions. The two widely used structure-based domain classifica-tions are SCOP (Murzin et al., 1995) and CATH (Orengo et al., 1997). Both SCOP and CATH have four-level hierarchical classifications. The four levels are: class (secondary structure con-tent), fold/architecture (arrangement of secondary structures), superfamilies/topology (connectivity between secondary structures), and families/homology (sequence, structure and function similarity). Figure 2.6 illustrates the CATH top level classification. Since it is well known that protein folds are often more evolutionary conserved than their sequences (Chothia and Lesk, 1986), structure-based classifications are able to identify evolutionary relationships not detected by sequence analysis and hence may provide better insights into function. For these reasons, several groups have calcu-lated structure-based sequence alignments of SCOP or CATH domain families. PALI (Gowri et al., 2003) and DALI (Holm and Rosenstrom, 2010) are two examples of databases of structure-based sequence alignments.
Integrated Resource of Domain Classifications. Given the growing number of computational methods to identify domains in protein sequences and structures, some integrated resources have been developed to provide a unified framework for domain analysis. These include CDD (Marchler-Bauer et al., 2009) and InterPro (Hunter et al., 2012). CDD combines domain definitions from several sources such as Pfam (Finn et al., 2010), SMART (Letunic et al., 2009), and TIGRFAM (Selengut et al., 2007) with 3D structure information from the PDB to define domain boundaries and to guide multiple sequence alignments. InterPro is currently the largest integrated database of domain definitions and functional annotations. It includes Pfam, SMART, TIGRFAM, ProDom (Bru et al., 2005), PIRSF (Nikolskaya et al., 2006), HAMAP (Lima et al., 2009), SUPERFAMILY (de Lima Morais et al., 2011), CATH-Gene3D (Lees et al., 2010), PANTHER (Mi et al., 2010), and PROSITE (Sigrist et al., 2010).
Coverage of Protein 3D Structures
Why are Protein 3D Structures Important? Since the biological function of proteins are deter-mined by their 3D structures, it is essential to know their 3D structures to understand how they function at the molecular level. A 3D structure provides details about atomic contacts which may be useful in designing drugs to disrupt an interaction. Currently, the principal experimental techniques used to obtain 3D structures are crystallography (X-ray), nuclear magnetic resonance (NMR) and cryo-electron microscopy (cryo-EM).
3D Structure Repository and Coverage. The Protein Data Bank (PDB; Berman et al. (2002); http://www.rcsb.org/pdb/) is the main worldwide repository of 3D protein structures. Currently, the PDB contains some 80,000 protein structures. X-ray and NMR techniques account for over 99% (88% X-ray and 11% NMR) of the 3D structures deposited in the PDB. Compared to UniRef100, which has some 18,000,000 distinct sequences, the PDB has only 47,000 distinct sequences (as of September 2012).4 This means that there are many proteins for which there are no known 3D structures. Furthermore, due to limitations in current experimental techniques, such as the difficulty in obtaining protein crystals, it is unlikely that all proteins will have their 3D structures solved in a foreseeable future. For this reason, important efforts has been made to develop computational approaches to predict the 3D structures of proteins from their amino acid sequence (Section 2.1.5; reviews by Wallner and Elofsson, 2005 and Zhang, 2008).
Conserved Protein Folds. From the principle of homology, evolutionarily related (homologous) protein sequences are generally assumed to share a similar 3D structure. One of the earliest stud-ies of protein structures estimated that the large majority of proteins belong to about one thousand fold families (Chothia, 1992), suggesting that protein folds are often more evolutionarily conserved than their sequences (Chothia and Lesk, 1986). The current versions of the main protein struc-tural classifications SCOP and CATH report 1,195 and 1,282 protein folds, respectively. Although Chothia’s estimate has stood 20 years, it is difficult to say if nature is indeed restricted to these one thousand or so fold families. For example, other estimates range up to a few thousands (Govin-darajan et al., 1999). However, statistics from the PDB (Figure 2.7) show that there has been no significant growth in the number of distinct folds for both SCOP and CATH during the last five years.
Table of contents :
1.1 The Protein Interactome
1.2 Modelling 3D Structures of Protein-Protein Complexes
1.3 Structural PPI Resources
1.4 Knowledge Discovery in Databases and Data Mining
1.5 Thesis Aims and Objectives
1.6 Overview of Thesis
2 Biological Context – Modelling 3D Protein-Protein Interactions
2.1 Protein Molecules and their 3D Structures
2.1.1 Why Study Proteins?
2.1.2 Building Blocks and Architecture of Proteins
2.1.3 Protein Domains and their Classifications
2.1.4 Coverage of Protein 3D Structures
2.1.5 Computational Methods to Predict 3D Structures of Proteins
2.2 Protein-Protein Interactions and their 3D Structures
2.2.1 Why Study Protein-Protein Interactions?
2.2.2 Databases of Experimentally-Detected and Predicted PPIs
2.2.3 Different Types of Protein-Protein Interactions
2.2.4 Coverage of 3D Protein-Protein Interactions
2.2.5 Previous Analyses of Protein-Protein Complexes
2.2.6 Current Protein-Protein Interface Prediction Algorithms
2.3 Modelling 3D Structures of Protein-Protein Complexes
2.3.1 Template-Based Modelling of Protein Complexes
2.3.2 Ab-Initio Docking
2.3.3 The CAPRI Blind Docking Experiment
2.4 Existing Structural PPI Resources
2.4.1 Classifications of 3D Structures of Protein-Protein Complexes
2.4.2 Characterisations of Protein Functional Sites
2.4.3 Classifications of Protein-Protein Interfaces
2.4.4 Structural Databases of Protein-Protein Complexes
2.4.5 Integrated Databases, APIs and Libraries
2.4.6 Docking Benchmark Datasets
3 Introducing KBDOCK – An Integrated Database of 3D Protein Domain Interactions
3.2 The Three Selected Data Sources
3.2.1 The Pfam Protein Domain Family Database
3.2.2 The 3DID Domain-Domain Interaction Database
3.2.3 The Protein Data Bank
3.3 Representing and Querying Pfam and 3DID Data Using Prolog
3.4 Collecting Representative Biological Hetero Structural PPIs
3.4.1 Classifying DDIs as Intra, Homo and Hetero
3.4.2 Distinguishing Between Crystallographic and Biological Contacts
3.4.3 Obtaining a Non-Redundant Set of DDIs
3.5 Annotating DDIs with Sequence and Structural Information
3.5.1 Identifying Conserved PDB Residues Using Pfam Consensus Sequences
3.5.2 Classifying Interface Residues as Core or Rim
3.5.3 Adding Secondary Structure Information Using DSSP
3.6 Superposing DDIs in 3D Space Using ProFit
3.7 Summary of the KBDOCK Data Processing Steps
3.8 The KBDOCK Data Model
3.9 Exploring DDIs in Protein Domain Families with KBDOCK
3.9.1 Querying KBDOCK
3.9.2 Exploring Pfam Domain Family Superpositions
4 Spatial Clustering of Protein Domain Family Binding Sites
4.1 Previous Protein-Protein Interface Classifications
4.1.1 The PIBASE Domain-Domain Interface Classification
4.1.2 The SCOPPI Domain-Domain Interface Classification
4.1.3 The 3DID Database of Domain-Domain Interfaces
4.1.4 The I2I-SiteEngine Protein-Protein Interface Classification
4.1.5 Keskin’s Classification of Protein-Protein Interfaces
4.1.6 The PPiClust Approach for Clustering Protein-Protein Interfaces
4.2 Previous Studies of Protein-Protein Interaction Modes
4.2.1 Aloy’s Analysis of Interaction Modes Between Domain Families
4.2.2 Korkin’s Analysis of Binding Sites Within SCOP Families
4.2.3 Shoemaker’s Analysis of Interaction Modes Between Domain Family Pairs
4.3 How Large is the Space of Interface Types?
4.4 Reusing Protein Interface or Binding Site Information
4.5 Classifying Domain Binding Sites in KBDOCK
4.5.1 Defining a Domain Binding Site Vector
4.5.2 Spatial Clustering of Domain Binding Site Vectors
4.6 Defining Domain Family Binding Sites
4.7 Distribution of DFBS in Pfam Domain Families
5 Classifying and Analysing Domain Family Binding Sites
5.1 Related Work on Protein-Protein Interface Analysis
5.1.1 Various Ways of Dissecting Protein Binding Sites
5.1.2 Hot Spot Residues
5.1.3 Hydrogen Bonds and Salt Bridges Across Interfaces
5.1.4 Interface Residue Composition
5.1.5 Interface Residue-Residue Contacts
5.1.6 Conservation of Amino Acid Residues at Interfaces
5.1.7 Non-Homologous Interactions With Structurally-Similar Faces
5.1.8 Secondary Structure Preferences at Interfaces
5.1.9 Structural Analyses of Hub Proteins
5.2 Large-Scale Analysis of Protein Domain Family Binding Sites
5.3 KBDOCK Provides a Large Dataset for Statistical Analyses
5.4 Annotating DFBSs with Secondary Structure Information
5.5 Classifying and Analysing DFBSs
5.6 Secondary Structure-Based Classification of DFBSs
5.7 Do DFIs Have SSE Pairing Preferences?
5.8 Are Binding Site Surfaces Special?
5.9 Are Multi-Partner Binding Sites Special?
5.10 Discussion and Conclusion
6 Protein-Protein Docking Using Case-Based Reasoning
6.2 Overview of Case-Based Reasoning
6.3 A Formal CBR Approach to Docking By Homology
6.4 The KBDOCK Case Representation
6.5 The KBDOCK Case Retrieval
6.5.1 Pfam-based Case Retrieval
6.5.2 The Single-Domain Docking Test Set
6.5.3 Coverage of FH, SH-two and SH-one Cases
6.6 The KBDOCK Case Adaptation
6.6.1 Modelling FH Problems Using Substitution Adaptation
6.6.2 Modelling SH Problems Using Transformation Adaptation
6.6.3 Evaluating the FH and SH Cases
6.6.4 Summary of KBDOCK Case Retrieval Results
6.7 The KBDOCK Case Refinement
6.7.1 The Extended Docking Test Set
6.7.2 Docking Refinement Results for Single-Domain Targets
6.8 Modelling Multi-Domain Docking Problems
6.8.1 Aggregating Multiple DDIs
6.8.2 KBDOCK Modelling Results for Multi-Domain Targets
6.9 Discussion and Conclusion
7.1 Summary of the Main Contributions
7.1.1 The KBDOCK Database of 3D Non-Redundant Hetero DDIs
7.1.2 The Domain Family Binding Site Concept
7.1.3 Structural Classification and Study of Domain Family Binding Sites
7.1.4 Case-Based Protein Docking
7.1.5 The KBDOCK Web Server
7.2 Timeliness and Novelty
7.3 Future Extensions to KBDOCK
7.4 Future Prospects