Fundamental review of proteins
This chapter provides the reader with fundamental notions in biology that are men-tioned throughout the manuscript and necessary for understanding the practical moti-vation of our work. The content is inspired from the Ecole Polytechnique text book of molecular and cellular biology by Yves Gaudin, Arnaud Echard and Sandrine Etienne-Manneville , the book on membrane structural biology by Mary Luckey , and J´erˆome Waldisp¨uhl’s PhD thesis .
We rapidly present the amino acids, constituent of proteins, before describing the properties and structures of the proteins themselves. Then, we focus on the class of transmembrane proteins, especially the β-barrels which are the subject of our whole work. We finally describe the problem of protein structure prediction and present the methods that have been developed to solve it.
They contain an amine group NH2, a carboxylic group COOH and an organic sub-stituent R. In aqueous solution at neutral pH, amino acids exist in the zwitterionic form where the amine functional group is protonated (NH+3) and the carboxylic functional group is deprotonated (COO−). The substituent R, also called side chain, varies be-tween 20 diﬀerent standard amino acids. The four groups attached to the α-Carbon are distinguished (except for Glycine in which the side chain R consists of a hydrogen atom). Therefore, there exists two reflection-symmetric isomers L and D (see Figure 1.1), of which only L isomers are present in proteins.
Properties of amino acids
The individual properties of constituent amino acids play a major role in determining the conformation and function of the protein. They are determined by the amino acid side chains. We make use of certain particular properties in this work, such as electric charge, polarity and hydrophobicity which are able to be quantified.
Among these, the hydrophobicity is the most important factor. It measures the capacity of the amino acid to interact with water molecules or more generally its behavior in the solvent. Several hydrophobic scales have been developed [31, 36, 37, 61, 72, 107, 108, 131, 133, 134] (see Table 1.1). They are clearly diﬀerent due to the various methods that are used for measuring the hydrophobicity. Some methods examine proteins with known three-dimensional structures and define the hydrophobic character as the tendency for a residue to be found inside the protein rather than on its surface. Others result from the physiochemical properties of the amino acid side chains. The widely used Kyte-Doolittle scale  can help detect hydrophobic regions in proteins, in which regions with a positive value are considered hydrophobic. This scale can work for predicting surface-exposed regions as well as for finding transmembrane domains. The Engelman scale , or GES scale, is useful for prediction of transmembrane regions in proteins. Eisenberg et al.  proposed a normalized consensus scale which has many common features with other hydrophobicity scales. Hopp-Woods scale  can be used for identification of putative antigenic sites in proteins. Cornette et al.  compared thirty-eight published hydrophobicity scales for their ability to identify the amphipathic α-helices and proposed an optimized scale using the eigenvector method. Janin scale  and Rose scale  evaluate the accessible and buried amino acid residues of globular proteins. Certain scales are calculated for specific classes of proteins: for instance, White & Wimley scale  evaluates the ability of amino acids to penetrate the hydrophobic membrane environment.
The 20 amino acids are classified into diﬀerent categories regarding the properties of their side chain. The following is the most common classification.
Glycine is the most simple amino acid with a hydrogen atom in the side chain.
Alanine, valine, leucine and isoleucine possess an aliphatic side chain that makes them hydrophobic.
Serine and threonine have an aliphatic side chain with a polar hydroxyl group.
Phenylalanine, tyrosine and tryptophan contain an aromatic group. The hydroxyl function of tyrosine is a weak acid with pKa ∼ 10. Tyrosine is then ionizable but not ionized in physiological conditions. Lysine, arginine and histidine are basic. Lysine and arginine have a high pKa in solution (10.5 and 12.5, respectively), and thus positively charged in physiological conditions. The low pKa of histidine (∼ 6) makes it neutral or protonated following the pH of the solution.
Aspartate and glutamate are acid (with low pKa of about 3.9 and 4.3, respectively) and negatively charged at neutral pH (named also aspartic acid and glutamic acid).
Asparagine and glutamine are the amidated products of aspartate and glutamate, and thus not ionisable.
Cysteine and methionine possess a sulphur atom in their side chain. The sulfhydryl group in cysteine is a highly potent nucleophile and also a weak acid. It can be easily oxidized to form with another cysteine a disulfide bond which stabilizes the tridimensional conformation of proteins.
Proline has a formula that is diﬀerent from other amino acids. The cyclic secondary amino function gives it a specific role in the establishment of the tridimensional structure of proteins.
A peptide bond is a covalent bond formed between the α-carboxylic group of an amino acid and the α-amine group of the other one. This process combines two amino acids into an amide (dipeptide) and releases a molecule of water (H2O). It is thus called a dehydration reaction or a condensation reaction, Amino acids in a protein are covalently linked together by peptide bonds to form a non-branching polypeptide chain. A unit of amino acid is called a residue. A polypeptide possesses an amino-terminal extremity (N-terminus) and an carboxy-terminal extremity (C-terminus). The synthesis of a polypeptide is carried out in a so-called “translation” process, where residues are consecutively added from its N-terminus. The N-terminus is then considered as the beginning of the chain.
and a variable part of amino acid side chains Ri, where i denotes the residue position counting from the N-terminus. These side chains precisely determine the specific prop-erties and functions of each protein. The sequence of amino acids of a polypeptide chain is known as its primary structure.
The peptide bond has characteristics of a double bond due to the mesomeric (reso-nance) eﬀect, thus the six atoms above are coplanar, making a peptide plan.
The trans configuration is energetically favored as it causes less repulsion between non-bonded atoms. The crystallographic studies showed almost constant values of distances and angles of the peptide bond for every polypeptide chain (see Figure 1.3).
As the geometry of a peptide plane is fixed, the torsion angles φ and ψ are two degrees of freedom in determining the conformation of the polypeptide chain. φ is the dihedral angle around the N–Cα bond, determined by the two carbons CO. ψ is around C–Cα bond, determined by the two nitrogens N (see Figure 1.4. There are strong constraints on the angles φ and ψ. Certain combinations are clearly impossible, while some others are energetically unfavorable. Ramachandran et al. [100, 101] introduced Ramachandran diagram to visualize graphically the backbone dihedral angles φ and ψ in the polypeptide chain of proteins. Each amino acid in the protein is represented with the coordinate (φ, ψ) in the plot in the range of [−180◦, 180◦] . The Ramachandran diagram of the constituent amino acids of the outer membrane protein A (PDB:1BXW) is presented in Figure 1.51. The limited regions of distribution of (φ, ψ) prove the restricted flexibility of the polypeptide chain.
Proteins are macromolecules constituted by a large number of amino acids, from a few dozens to several hundred. This is one of the four important organic macromolecules in living organisms, along with nucleic acids, carbohydrates and lipids. Many proteins are composed of only one polypeptide chain (namely monomer). Others can be formed of more than one chains, and thus are called oligomers (e.g., dimer, trimer, tetramer. . . ). If these chains are identical, the protein is called homo-oligomer. Otherwise, it is a hetero-oligomer. Each constituent chain is a subunit, also known as a protomer.
Proteins are essential in organisms and take part in almost every process in the cells. They are usually classified into three major classes according to their overall three-dimensional structures and their functional roles: fibrous, globular and membrane pro-teins.
Fibrous proteins (or scleroproteins), which tend to be elongated fibers, are generally inert and insoluble. These proteins are usually constructed of repetitive amino acid sequences. These characteristics make them appropriate to play structural roles in organisms for supportive and protective function. For example, keratin constructs hair, nails, and skin. . . ; collagen is abundantly found in connective tissues such as cartilage, tendons. . . ; elastin is important in ligaments, blood vessels. . . . An example of collagen is given in Figure 1.62.
Globular proteins, which comprise a large variety of proteins, are soluble and exist in an aqueous environment. Hence, these proteins generally have compact structures with polar residues on the surface and hydrophobic residues in the core. These proteins are the most described in the Protein Data Bank (PDB) , since their structures are usually stable, and thus easy to determine experimentally. Two of the most known globular proteins, myoglobin and hemoglobin, are the first two experimentally determined structures by John Cowdery Kendrew  and Max Ferdinand Perutz , which led to them receiving a Nobel Prize in Chemistry in 1962. The structure of myoglobin is presented in Figure 1.73.
Membrane proteins exist in the cell membranes – a phospholipid bilayer with hy-drophobic core. They typically have hydrophobic exposed regions in order to be stable in such an environment. Some proteins slightly adhere to the membrane, while others are embedded in the lipid bilayer. Among the latter, some proteins, namely transmembrane proteins, entirely span the biological membrane one or sev-eral times (polytopic proteins). Figure 1.84 illustrates the structure of insulin re-ceptor, a well known transmembrane protein which helps induce glucose uptake, thus causes diabetes in case of its insensitivity.
Table of contents :
1 Fundamental reviewofproteins
1.2.1 Amino acids
1.2.2 Properties of amino acids
1.2.3 Peptide bond
1.2.5 Protein structure
1.3 Transmembrane proteins
1.3.1 Biological membrane
1.3.2 Transmembrane proteins
1.4 Folding energy
1.4.1 Partial charges
1.4.2 Electrostatic interaction
1.4.3 Hydrogen bond
1.4.4 Van der Waals forces and steric repulsion
1.4.5 Hydrophobic effect and interaction with the environment
1.4.6 Torsion energy around peptide bonds
1.4.7 Other interactions
1.5 Protein structure determination
1.5.1 Experimental methods
1.5.2 In silico prediction
2 Folding β-barrels
2.2 Geometric framework for β-barrels
2.3 Physicochemical constraints
2.4 Classification filtering
2.5 Folding problem definition
2.5.3 Energy attributes:
2.5.4 Protein folding problem
2.6 Dynamic programming approach
2.6.1 Solving as the longest path problem
2.6.2 Solving as the longest closed path problem
2.7 Complexity on permuted structures
3 Tree-decomposition basedalgorithm
3.2 Graph-theory background
3.2.1 Tree decomposition
3.2.2 Modular decomposition
3.4 Algorithm for finding barrel structures of minimum energy
3.5 About Greek key motifs in β-barrels
4 Evaluation of performance of BBP
4.2 Experimental setup
4.3 Implementation details
4.4 Method of evaluation
4.4.1 Concepts on predicted secondary structures
4.4.2 Measures of performance
4.5 Experimental results
4.5.2 Evaluation of the shear numbers
4.5.3 Influence of the filtering threshold
4.5.4 Evaluation on mutated sequences
4.5.5 Permuted structures
Conclusion and perspectives