This chapter introduces some basic concepts about proteins that are objects of this study. In section 2.1, we discuss the importance of proteins for living organisms, and present some concepts about their structures and functions. Proteins in diﬀerent organisms can share a common ancestor, and they are called homologous proteins, we discuss their evolutionary relationships in section 2.2, also we describe in section 2.2.1, the remote homology detection problem, that is the central problem addressed in this thesis. We finalize this chapter presenting in section 2.3 some key concepts needed for understanding the rest of the thesis.
Structure and function
The deoxyriboNucleic Acid (DNA) carries the genetic information of the living organism cells, this information is encoded within thousands of genes. Each gene serves as a recipe on how to build a protein molecule. Two cellular processes are involved in the synthesis of a protein: transcription  and translation . During the process of transcription the genetic information stored in a gene is transferred to a molecule called message RiboNucleic Acid (mRNA), and during the process of translation this information is decoded by the ribosome (a cellular component) to produce a specific amino acid chain that will fold into an active protein. The information encoded in the mRNA is « read » according to the genetic code, which relates the DNA sequence to the amino acid sequence in the protein. Each group of three nucleotides in mRNA constitutes a codon, and each codon specifies a particular amino acid. Amino acids are considered as building blocks of proteins. An amino acid is a molecule containing an amine group (NH2), a carboxylic acid group (CO2H), and a side-chain that is specific to each amino acid. Basically, there are 20 diﬀerent amino acids, and they can be divided according to their physico-chemical properties. The most important properties are charge, hydrophobicity, hydrophilicity, size, and side-chain specificity, see Figure 2.1, that shows for each amino acid its chemical formula (side chains are highlighted in red) and its physico-chemical properties. These 20 amino acids can be arranged in several diﬀerent ways to create a number of diﬀerent proteins. Proteins are typically folded into a particular three-dimensional structure that is related to their biological function. Physico-chemical properties of amino acid proteins play a major role in folding protein structure. For instance, the water-soluble proteins tend to have their hydrophobic residues buried in the middle of the protein, whereas hydrophilic side-chains are exposed to the aqueous solvent.
Proteins are involved in a huge number of activities within a cell, such as: the gene regulation, the RNA transcription, the protein translation, the transport of materials, the catalysis of biochemical reactions (enzymes), they act as receptors for hormones, etc. Fre-quently, the function of a protein is determined by its structure. During the synthesis of a protein, that is, the cell process where proteins are produced, the protein structure is in its primary form. This form, also called primary structure, refers to amino acid sequence of the protein. Amino acids are held together by chemical bonds, which are made during the process of protein synthesis. As a next step regular local sub-structures, known as secondary structure, are formed. There are two main types of secondary structures: the alpha helix and the beta strand . These regular structures are connected by a “loop » . Loops are uncoiled regions of variable sizes. Next, the alpha-helices, beta-strands and loops are folded into a compact globule by forming the three-dimensional structure. Many proteins are formed by a larger assembly of several protein molecules, usually called subunits. These subunits form complexes called quaternary structure. Figure 2.2 illus-trates the four structural descriptions of the protein. Note that, they describe diﬀerent structural subunits of the protein, and they do not illustrate intermediate steps of the folding process. No precise knowledge of the folding process is yet available.
Similarities among species suggest that all living organisms have origin in the same ances-tor. Evolutionary processes, such as gene duplications  and mutations  give rise to diversity at every level of biological organization, including species and molecules such as proteins. For instance, when a gene is essential for a given species it can be duplicated, and two copies of this gene are produced. For example, in Figure 2.3-A the gene A was duplicated, and two identical copies A1 and A2 were created, these homologous genes are called paralogs. After duplication, genes A1 and A2 evolve independently, and they can suﬀer mutation events by producing paralogous genes A3 and A4, respectively, see Figure 2.3-B. Through speciation  new species arise, as shown in Figure 2.3-C, where species II and III were created from species I. Homologous genes in diﬀerent species are called orthologs, see Figure 2.3-D. Since proteins are produced from genes, homologous genes produce homologous proteins.
A common way for studying the evolutionary relationships in homologous proteins is to perform a Multiple Sequence Alignment (MSA), as shown in Figure 4.1. The alignment of homologous proteins consist of trying to place amino acids in positions that derive from a common ancestral amino acid. To do so, we need to introduce gaps, which represent insertion or deletion into sequences. Thus, an alignment is a hypothetical model of muta-tions (substitutions, insertions, and deletions) that occurred during sequence evolution.
Remote homologous proteins
Homology can be detected easily if a strong sequence similarity is observed among proteins. A possible way to measure this similarity is to determine sequence identity among homol-ogous proteins, that is, the percentage of identical amino acids in a protein alignment. If sequence identity is greater than 30%, homology can be asserted  with confidence. Otherwise, we say that these proteins are in the “Twilight zone”, where homology signals get blurred, and more evidences are needed to confirm the homology. However, proteins can be homologs even in the case of low sequence identity. They are known as remote homologous proteins, that is, they have a common ancestor but they have diverged sig-nificantly in their primary sequence during their evolutionary history. To illustrate the concept, observe Figure 2.4-A that shows two homologous proteins: on the top, see their sequence alignment, where identical amino acids are indicated by the symbol ∗, and on the bottom, see their structural alignment. Note that, sequence identity in the sequence alignment is low and that the hydrophobic blocks (highlighted in grey), known to play an important role in protein structural stability  are not aligned. Based only on these observations, we cannot assert that these proteins are homologs and only an analysis of their structural similarities can conclude it. On the other hand, non-homologous pro-teins such as those show in figure 2.4-B can present physico-chemical similarities like the conservation of hydrophobic blocks, but their structures show clearly that they are not homologs. This shows that remote homology detection is hard when using only sequence properties. To make it possible, we should mine valuable properties from homologous protein sequences that allow us to identify homologous proteins, and in the same time, avoid false predictions. For this, we propose two methods presented in chapters 4 and 5. Remote homology detection is a challenge for the computation biology. There is still a number of proteins with unknown function, and although structural properties can be useful to decrease this number, this information is not available for most proteins, and consequently, it cannot be used in large-scale approaches. In this scenario, it is necessary to develop intelligent strategies to address the remote homology detection problem, as those presented in chapters 4 and 5.
Protein families, domains and motifs
Homologous proteins can be organized into protein families. Proteins in a family typically have similar three-dimensional structures, functions, and some times significant sequence similarity. To organize homologous proteins in families can serve to extract important rules and to provide rich automatic functional annotation. For example, sequences within a protein family can be aligned to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Also, the evolution of these proteins can be studied by reconstructing a phylogenetic tree  that shows how proteins in this family have evolved.
At the functional level, proteins can be organized in domains or motifs. A domain is a part of the protein sequence which can fold into a stable structure independently on the rest of the sequence. Proteins considered as related often share the same domain(s). Domains are considered as building blocks and they may be recombined in diﬀerent arrangements to create proteins with diﬀerent functions. Many proteins consist of several structural domains, and they are called multi-domain proteins. Motif is a short stretch of amino acid sequence that potentially encode the function of proteins. Frequently, they are located inside protein domains.
There is a number of diﬀerent classification systems to organize protein. They are based on diﬀerent classification categories: (1) hierarchical protein families, such as: PIR-PSD  and ProtoMap , (2) families of protein domains such as Pfam , TIGRFAMs  and ProDom , (3) sequence motifs such as PROSITE  and PRINTS , (4) structural classes, such as SCOP  and CATH , and (5) integrations of various family classifications, such as iProClass  and InterPro . These tools can be interrogated to provide the probable function for a query sequence (that is, a protein with unknown function). For this, computational approaches discussed in the next chapter are employed.
Table of contents :
1.2 Challenges of detecting remote protein homology
1.3 Significance and contribution of the thesis
1.5 Organization of the thesis
2.1 Structure and function
2.2 Homologous proteins
2.2.1 Remote homologous proteins
2.3 Protein families, domains and motifs
3 Methods for remote homology detection
3.1 Sequence Similarity Searching
3.2 Generative methods
3.2.1 Position-Specific Iterative BLAST
3.2.2 Profile Hidden Markov Models
3.2.3 Methods based on domain co-occurrence
3.3 Discriminative methods
3.3.1 Support vector machine
3.3.2 Inductive logic programming
3.4 Comparing different methods
4 ILP-SVM Homology
4.2.1 Dataset description
4.2.2 Logical representations
4.2.3 Construction of propositional classifiers
4.2.4 Comparison between different methods
4.2.5 Parameter settings and tools used
4.3 Results and Discussion
5 CASH – Combination of Annotations by Species and pHMMs
5.2.2 Selection of representative species from the eukaryotic tree of life
5.2.3 Pfam methodology
5.2.4 Phylogenetic Models
5.2.5 Combining Models Predictions
5.2.6 Resolving protein domain architectures
5.2.7 Prediction analysis
5.2.8 Comparison with earlier results
5.2.9 Visualizing our results
5.2.10 Parameter settings and tools used
5.3 Results and Discussion
6 General conclusions and future work
A List of representative species used in CASH system
B List of Pfam Ribosomal families
C Domains and architectures over-represented in P. falciparum