Get Complete Project Material File(s) Now! »

## Grid-Based Methods

Grid-based methods are well known in protein docking due to their easy implemen-tation and high computational performance, since they take advantage of the fast Fourier transform (FFT) to score in one run all the translations corresponding to each rotation. Often, the receptor is fixed, while the ligand is rotated around.

In general, the surface and the core of the molecules are represented by 3D grids of N x N x N voxels characterized according to the parameters needed by the scoring function. Thus, two 3D grids are obtained, one representing the receptor and another the ligand, see figure 2.3. The basic idea is to fit surface regions, in such way that the interfaces must be complementary so that the proteins fit together well. Therefore, the accuracy of the docking is greatly based on the proper representation of the molecules surface. The size of the grid must be enough to contain the receptor and ligand in all the possible configurations to build feasible solutions. The number of voxels in the grid depends on the resolution level or detail the protein representation requires. To each voxel of the grid corresponds a special value to define the area belonging to the protein’s core, the protein’s surface and empty space, see figure 2.3. Usually, van der Waals atomic radii are used to decide which voxels are inside the region of surface Rl;m;n = 8 receptor core (2.1).

Here R and L are receptor and ligand respectively and l,m,n are the indices of the grid voxel. The protein region in the grid is represented by two kind of values to distinguish the core and surface. In the function above, represents the surface and the core receptor voxels, whereas the ligand surface and core voxels are represented by the and , respectively. Often the empty grid voxels are assigned with a zero value. The special values used to represent the voxels vary according the approach. For each shift of ligand L from receptor R on the grid, a score is obtained from the correlation between the equivalent grid voxels of R and L, which is calculated in an accelerated way by using FFTs. The values assigned to surface and core vary from method to method. For instance, in GRAMM (Global RAnge Molecular Matching) (Katchalski-Katzir et al., 1992) the surface value for and is one, zero for empty space, and small positive values for and large negative values for . This is done by using two parameters from the energy potential, R and U. R is the width of the negative energy well and is taken as the grid step value. U is the energy of repulsion. Everywhere beyond the 2R distance of an atom, the ernergy is 0. If the distance between the center of an atom and a given voxel on the the grid is shorter or equal to R then the value of the voxel is increased by U, otherwise if the distance is shorter than two times R then the value of -1 is added to the value of the voxel, observe example in Figure 2.4. In this way, when the contact is only between the surfaces the correlation value will be positive, whereas if it is a contact between the cores the correlation value will be negative. A more negative correlation correspond to a larger overlapping region. GRAMM has been widely used and extended to new versions by using a new grid projection Lennard-Jones potential function (Tovchigrechko and Vakser, 2005, 2006; Vakser, 1996).

Other example is PIPER, where the authors present other grid functions to map the proteins onto the 3D grid according to the terms they need to compute their energy function such as shape complementarity, electrostatic interactions and atom pairwise potentials (Kozakov et al., 2006). In ZDOCK 2.3/2.3.2, two complex functions were used to project the proteins onto the 3D grids, thus each voxel in the grid was described by two components one real and one imaginary. The imaginary part of each 3D grid voxel of the receptor and ligand can be described by either three, nine or zero. Zero represents empty space, nine the protein core and three the surface of the protein. The real part of each voxel in receptor 3D grid is represented by the counting of the atoms that are at a distance less than the atom radius plus some cutoﬀ. Thus, the strategy of this methodology is to favor the score according to the number of atom pair interactions that exist within a certain cutoﬀ, in such a way that models with higher number of atom interactions will obtain the best scores. On the other hand, the overlapping degree is evaluated by the correlation of the imaginary part of the voxels (Chen and Weng, 2003).

In Hex, density functions are used to compute a list of spherical harmonics coef-ficients to describe the proteins at the beginning of the docking process. Then, the two protein surface layers are projected on the 3D grids. The most external “skin” or layer represents the molecular region solvent-accessible, whereas the internal layer represents the internal atoms. The correlation is computed between these surface “skins” to find complementary interfaces (Ritchie and Kemp, 2000).

**Atomistic Scoring Functions**

Atomistic or “all-atom” scoring functions are designed to be used on high resolution (i.e. atomistic representations) 3D structures. The result given by an all-atom func-tion corresponds to the binding energy of the 3D structure being evaluated. The parameters used by these functions are often derived from data provided by experi-mental processes, such as X-ray diﬀraction and spectroscopy, simulations or quantum mechanics calculations. For every type of atom pairs in a system, a set of parameters is specified to evaluate van der Waals and electrostatics interactions by the all-atom function. The set often includes the optimal energy corresponding to the atom pair and the inter-atomic distance to obtain it, the partial atomic charges, and so on. The all-atom functions to compute the total potential energy of a complex, in general, are made up of a sum of terms as follows Etotal = Ehydrogen bonds + Edesolvation + Eelectrostatics + Evan der W aals: (2.3). The actual choice of terms varies from approach to approach. Ehydrogen bonds repre-sents the contribution of the hydrogen bonds to the total energy Etotal, Edesolvation the desolvation energies, Eelectrostatics the electrostatic potential often treated as a Coulumbic term and Evan der W aals the van der Waals interaction energy often mod-eled as a Lennard-Jones type of function. Since each atom pair in the molecule represents an arithmetic calculation, the computational cost of all-atom scoring methods can be expressed as O(N2), where

N represents the number of atoms in each protein. Hence, usually the computational cost of all-atom force fields is expensive, so that they are often only used as one of the last stages of refinement to improve the accuracy. In general, refinement strategies allow the movement of mainly the side chains (sometimes the backbone) to optimize their conformation aiming to minimize a scoring function (Dauzhenka et al., 2018; Mashiach et al., 2008; Li et al., 2003).

One example of a well known approach that uses an all-atom scoring function is Rosetta Dock (Gray et al., 2003), where after low-resolution searches, the rigid body position and the side chain conformations are optimized iteratively. At each step, one of the proteins is moved in small steps (random rotations of mean 0:05 and translations of mean 0.1 Å), whereas an optimal combination of rotamers for the side chains is searched, such search is often called side-chain packing. This is followed by a series of rigid-body minimizations, and at the end a final score is computed for each solution predicted (Gray et al., 2003). In SwarmDock and ClusPro, the last step consists in minimizing using CHARMM the possible solutions generated (Moal et al., 2018; Kozakov et al., 2013). Some all-atom functions were especially designed to refine or re-score docking solutions. Usually, they are implemented by adding some degree of flexibility in order to increase the quality of the solutions. For instance, the FireDock refines docking solutions from PatchDock in two steps. The first step consist of allowing flexibility to residues on the interface with the highest energy and minimizing the binding score function. The models obtained are refined again in a second step, but now all the interface residues are flexible, producing models to be scored again to acquire the final list of solutions (Andrusier et al., 2007). Other well known all-atom force fields used in simulations like AMBER, CHARMM and GROMOS are widely used to derive the force fields used in protein docking (Brooks et al., 1983; Oostenbrink et al., 2004; Weiner et al., 1984).

**Coarse-Grained Energy Functions**

Coarse-grained (CG) force fields aim to discriminate near-native solutions with the same accuracy as all-atom scoring functions but in a more eﬃcient way (Tozzini, 2005). The main advantages of using CG functions are the reduced computational cost O(n2) where n << N, and the ability to tolerate small clashes between side chains.

In order to avoid the high computational cost of the all-atom representations, CG models replace atoms in a residue with a small number (typically 3 or 4) of so-called CG “beads”. For example, two beads might be used to represent the backbone atoms and another bead would represent the side chain atoms. Residues having large side chains like ARG or LYS might use two beads to represent the side chain atoms. The beads representing side chains are usually located at the geometric centroid of the side chain atom positions.

Since side chains in nature are flexible, often when two molecules bind each other, the residues on the binding site suﬀer conformational changes. CG functions account for these conformational changes, and for the fact that beads are centroids of several atoms, by tolerating small molecular surface overlappings.

The challenge of designing a CG function involves proposing a very simplified model of the 3D structures with a highly accurate function. It is usual that CG functions are highly dependent from the CG representation for which they were built. Therefore they are not transferable among diﬀerent CG models or representa-tions. Often CG functions are derived from all-atom molecular dynamics simulations applied to some set of training models, from which the parameters can be computed. One example of a CG function is the PyDockCG, which is based on a previous model called UNRES CG, and is modified by the inclusion of terms for the electro-statics and the solvation energy. PyDockCG represents each residue by two beads, one for the peptide group and the other one for the side chain, as in UNRES CG, and a pair of dummy beads between receptor and ligand to improve the flexibility through minimizations (Solernou and Fernandez-Recio, 2011). In SCORPION each amino acid is represented using one bead for the backbone and one or two for side chains. The scoring function is composed by one part for the van der Waals energy and another part for the electrostatics, which values depend on the inter beads dis-tances (Basdevant et al., 2012). In this work, we use the ATTRACT CG function and model (Zacharias, 2003) explained in the Section 2.3.6.

**Statistical or Knowledge-based Energy Functions**

Statistical energy functions use information about known protein structures, plus heuristic terms in some cases, in order to build interaction potentials. The most used statistical data include distances, angular descriptions and frequency of contacts between atom pairs, CG beads or residues. The data sets used to train these functions must be big enough to include a wide variety of diﬀerent protein configurations. In this way, the function will be suﬃciently robust to score eﬃciently solutions whose configuration is considerably diﬀerent from those in the training data set.

One statistical scoring function called OPUS-PSP, for example, was built from statistics about the orientation dependence of pairs of blocks using a structural database plus a repulsive energy term to prevent steric clashes (Lu et al., 2008).

SPIDER is another example of a statistical energy function. SPIDER uses CG to represent the residues. Thus, a data set of protein complex interfaces is represented by graphs whose nodes correspond to residues connected by edges. Then, using geo-metrical parameters such as RMSD and the frequency of occurrence of the patterns found in the data set, the scoring function is developed. Frequent subgraph mining is employed to search matches between a decoy interface and at least one pattern in the set of natives patterns. From the matches, the parameters needed by the scoring function are obtained, such as the number of residues that match the native pattern, the fraction of interfacial residues that match the patterns, the number of patterns matched, and so on (Khashan et al., 2012).

**Pure Shape-Based Scoring**

Shape-based scoring functions are based on the important role of the surface com-plementarity on the formation of protein complexes.

Such complementarity is scored by the computation of the correlation between the two grids obtained as explained in the Sub-section 2.2.2. Each proteins involved in the complex is represented by a grid. The correlation is expressed as the addition of the product of all the equivalent voxels over the grids (Katchalski-Katzir et al., 1992; Vakser, 1995, 1996). This can be expressed mathematically as follows N N N ca;b;c = Rl;m;n Ll+a;m+b;n+c; Xl X X (2.4).

where R and L are the grids whose equivalent voxel values (see Fig. 2.3) are being multiplied and added. Thus, ca;b;c represents the correlation value for a voxel or cell in the grid that corresponds to a translation step (a, b, c). Since correlation can be computed by FFT (Fast Fourier Transform), often shape-based scoring methods take advantage of it due to the high computational speed it can provide. The original scoring function based on grid representations and FFT computations was presented in (Katchalski-Katzir et al., 1992) as part of an algorithm called GRAMM. Such proposed algorithm to obtain the correlation between two grids applying FFT is commonly used until now, and it is briefly can be described as follows (Katchalski-Katzir et al., 1992):

(i) Compute the complex conjugate of the FFT of the grid where R is projected, noted FFT(R)* (F F T (R) in Figure 2.5 ) and the FFT of the grid where L is projected, noted FFT(L), as in the 2D example of Figure 2.5.

(ii) Multiply FFT(R)* by FFT(L) to obtain the correlation function c.

(iii) Obtain correlation C by using the Inverse Fourier Transform of c as schematized in Figure 2.5.

(iv) The process is repeated for each orientation of L with respect to R.

### Mixed Shape Plus Potential Scoring functions

In PIPER another shape-base function is proposed as a sum of three terms repre-senting shape complementarity, electrostatic contribution and desolvation obtained by a pairwise potential (Kozakov et al., 2006). Each term is obtained by correlation functions using FFT of the 3D grids of the proteins projected as explained at the Subsection 2.2.2.

Another shape-based function is implemented in ZDOCK algorithm, where the proteins are mapped to a 3D grid identifying their corresponding core and surface, as well as the not occupied points, by assigning strategical values to each point of the grid. Then, using FFT, the solutions are scored, favoring atoms pairs between the ligand and receptor within a distance cutoﬀ and penalizing overlapping contacts (Chen and Weng, 2003).

#### The ATTRACT Scoring Function

Because of the advantages the CG force fields oﬀer, such as tolerance to small clashes and faster execution time in comparison to atomic resolution, as detailed in Subsec-tion 2.3.2, the ATTRACT CG force field is used in this work (Zacharias, 2003; Fiorucci and Zacharias, 2010). In this approach, the backbone is represented by two pseudo-atoms located at the nitrogen and the oxygen atoms. Small amino acid side chains are represented by one pseudo atom and larger side chains by two pseudo atoms (or “beads”), see left side of the figure 2.6. At the right side of the figure, the graphic representation of the force field can be appreciated, where represents the pseudo atom radius, and Rmin the separation distance for two beads to obtain the minimum energy emin.

**Table of contents :**

**Part I Introduction **

**Chapter 1 Context**

1.1 Protein Structures and Complexes

1.2 The Biological Importance of Protein Interactions

1.3 Experimental Determination of Protein Structures and Protein Complexes

1.3.1 Experimental Techniques

1.3.2 The Protein Data Bank and The EM Data Bank

1.4 Protein Docking Algorithms

1.5 The Structure of the Thesis

**Chapter 2 State of the Art in Protein-Protein Docking**

2.1 Rigid Body Docking

2.2 Sampling Methods

2.2.1 Random Methods

2.2.2 Grid-Based Methods

2.3 Scoring Methods

2.3.1 Atomistic Scoring Functions

2.3.2 Coarse-Grained Energy Functions

2.3.3 Statistical or Knowledge-based Energy Functions

2.3.4 Pure Shape-Based Scoring

2.3.5 Mixed Shape Plus Potential Scoring functions

2.3.6 The ATTRACT Scoring Function

2.4 Using of Distance Restraints to Drive Docking

2.5 Multi-Body Docking Algorithms

2.6 Conformational Changes Upon Binding and The Challenges of Flexible Docking

2.7 Axis-Angle and Quaternion Representation of Transformation Matrices

2.7.1 3D Rigid Transformations

2.7.2 Axis-Angle Representation of a 3D Rotation

2.7.3 A Unit Quaternion to Represent a 3D Rotation

2.8 Branch-and-Bound Search Algorithms

2.9 Solutions Assessment

**Part II Contribution **

**Chapter 1 Pairwise Docking Using Branch-and-Bound 3D Rotational Searches**

1.1 Introduction

1.2 The Branch-and-Bound 3D Rotational Search Approach

1.2.1 The Initial Docking Poses

1.2.2 The Rotational Search Space Represented as -Ball

1.2.3 Pruning Rotational Searches Using Bead-Radius Cone Angles

1.2.4 Coloring the 3D Rotational Space Represented as a Tree Structure

1.2.5 Energy Computation and Clustering Solutions

1.3 Results Using the Protein Docking Benchmark and Discussion .

1.4 Summary

1.4.1 EROS-DOCK Algorithm pseudo-code

1.4.2 EROS-DOCK Algorithm Flowchart

1.5 Conclusions and Perspectives

**Chapter 2 Pairwise Docking Using Branch-and-Bound Rotational Searches and Distance Restraints**

2.1 Introduction

2.2 Docking Using Distance Restraints

2.2.1 Restraints Specification

2.2.2 The Initial Docking Poses According to The Restraint Specification

2.2.3 Branch-and-Bound Rotational Searches Using Distance- Restraint Cone Angles

2.2.4 Coloring the 3D Rotational Space

2.2.5 Energy Computation and Clustering Solutions

2.3 Results Using Benchmark and Discussion

2.4 Conclusions and Perspectives

**Chapter 3 Multibody Docking Using Branch-and-Bound Rotational Searches and Distance Restraints**

3.1 Introduction

3.2 EROS-DOCK Extension for Multi-Body Docking

3.2.1 A pairwise strategy for trimeric complexes

3.2.2 Deriving pairwise transformation matrices from EROSDOCK solutions

3.2.3 Coloring the 3D Rotational Space

3.2.4 Computing compatible combinations of pairwise solutions to form trimeric complexes

3.3 Test and Results On Trimers

3.3.1 Benchmark

3.3.2 Results

3.4 Conclusions and Perspectives

Conclusions and Perspectives

Appendices

**Bibliography **