Similarities among species suggest that all living organisms have origin in the same ances-tor. Evolutionary processes, such as gene duplications  and mutations  give rise to diversity at every level of biological organization, including species and molecules such as proteins. For instance, when a gene is essential for a given species it can be duplicated, and two copies of this gene are produced. For example, in Figure 2.3-A the gene A was duplicated, and two identical copies A1 and A2 were created, these homologous genes are called paralogs. After duplication, genes A1 and A2 evolve independently, and they can suﬀer mutation events by producing paralogous genes A3 and A4, respectively, see Figure 2.3-B. Through speciation  new species arise, as shown in Figure 2.3-C, where species II and III were created from species I. Homologous genes in diﬀerent species are called orthologs, see Figure 2.3-D. Since proteins are produced from genes, homologous genes produce homologous proteins.
A common way for studying the evolutionary relationships in homologous proteins is to perform a Multiple Sequence Alignment (MSA), as shown in Figure 4.1. The alignment of homologous proteins consist of trying to place amino acids in positions that derive from a common ancestral amino acid. To do so, we need to introduce gaps, which represent insertion or deletion into sequences. Thus, an alignment is a hypothetical model of muta-tions (substitutions, insertions, and deletions) that occurred during sequence evolution.
Remote homologous proteins
Homology can be detected easily if a strong sequence similarity is observed among proteins. A possible way to measure this similarity is to determine sequence identity among homol-ogous proteins, that is, the percentage of identical amino acids in a protein alignment. If sequence identity is greater than 30%, homology can be asserted  with confidence. Otherwise, we say that these proteins are in the “Twilight zone”, where homology signals get blurred, and more evidences are needed to confirm the homology. However, proteins can be homologs even in the case of low sequence identity. They are known as remote homologous proteins, that is, they have a common ancestor but they have diverged sig-nificantly in their primary sequence during their evolutionary history. To illustrate theA3 concept, observe Figure 2.4-A that shows two homologous proteins: on the top, see their sequence alignment, where identical amino acids are indicated by the symbol ∗, and on the bottom, see their structural alignment. Note that, sequence identity in the sequence alignment is low and that the hydrophobic blocks (highlighted in grey), known to play an important role in protein structural stability  are not aligned. Based only on these observations, we cannot assert that these proteins are homologs and only an analysis of their structural similarities can conclude it. On the other hand, non-homologous pro-teins such as those show in figure 2.4-B can present physico-chemical similarities like the conservation of hydrophobic blocks, but their structures show clearly that they are not homologs. This shows that remote homology detection is hard when using only sequence properties. To make it possible, we should mine valuable properties from homologous protein sequences that allow us to identify homologous proteins, and in the same time, avoid false predictions. For this, we propose two methods presented in chapters 4 and 5.
Protein families, domains and motifs
Homologous proteins can be organized into protein families. Proteins in a family typically have similar three-dimensional structures, functions, and some times significant sequence similarity. To organize homologous proteins in families can serve to extract important rules and to provide rich automatic functional annotation. For example, sequences within a protein family can be aligned to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Also, the evolution of these proteins can be studied by reconstructing a phylogenetic tree  that shows how proteins in this family have evolved.
At the functional level, proteins can be organized in domains or motifs. A domain is a part of the protein sequence which can fold into a stable structure independently on the rest of the sequence. Proteins considered as related often share the same domain(s). Domains are considered as building blocks and they may be recombined in diﬀerent arrangements to create proteins with diﬀerent functions. Many proteins consist of several structural domains, and they are called multi-domain proteins. Motif is a short stretch of amino acid sequence that potentially encode the function of proteins. Frequently, they are located inside protein domains.
There is a number of diﬀerent classification systems to organize protein. They are based on diﬀerent classification categories: (1) hierarchical protein families, such as: PIR-PSD  and ProtoMap , (2) families of protein domains such as Pfam , TIGRFAMs  and ProDom , (3) sequence motifs such as PROSITE  and PRINTS , (4) structural classes, such as SCOP  and CATH , and (5) integrations of various family classifications, such as iProClass  and InterPro . These tools can be interrogated to provide the probable function for a query sequence (that is, a protein with unknown function). For this, computational approaches discussed in the next chapter are employed.
An alternative to SSS are generative methods. First, they train a probabilistic model to represent a group of homologous sequences (a protein family), and then they match a query sequence against the model to evaluate the similarity of the query sequence to the group/family. Position-Specific Iterative BLAST (PSI-BLAST) , explained in sec-tion 3.2.1, and profile Hidden Markov Models (pHMMs) , described in section 3.2.2 are examples of such approaches, also known as family-based or sequence-profile based ap-proaches. Finally, in the section 3.2.3 we present how the performance of generative meth-ods can be improved by using domain co-occurrence information [20, 21, 22, 23, 24, 25].
Position-Specific Iterative BLAST
Position-Specific Iterative BLAST (PSI-BLAST)  is an iterative method that builds a probabilistic model based on Position Specific Score Matrix (PSSM) , and uses it to search for new matches in a subsequent iteration. In each iteration PSSM is updated by using sequences obtained in the previous iteration. The first PSSM is built from the multiple alignment of sequences extracted from BLAST results, while subsequent PSSMs are refined by using new matches. Figure 3.2 summarizes the PSI-BLAST flowchart. Note that, PSI-BLAST can be executed for a fixed number of iterations N or until convergence (when no new matches are found).
Methods based on domain co-occurrence
The majority of proteins are multidomains, domains do not form random combinations, and we observe fewer combinations than the statistically expected ones. This suggests functional cooperation, that is, two or more domains can interact to determine the pro-tein function. Recently, this cooperation, also called domain co-occurrence, has been used to improve the performance of domain recognition methods [20, 21, 22, 23, 24, 25]. In this section, we reviewed two of these methods that were applied to P. falciparum an-notation: CODD (Co-Occurrence Domain Discovery)  and dPUC (domain Prediction using context) . The performance of these methods is compared to the CASH method in chapter 5.
Both methods proposed to improve pHMM performance, and they were implemented on Pfam database , a collection of protein domains largely used for annotation task. They detect a set of potential Pfam domains Pt, for a given protein sequence st, by setting a permissive Pfam threshold and by allowing overlaps. They pre-compute a list L, containing domain pairs that present strong co-occurrence, directly from the list of domain architectures in Uniprot database . CODD ranks Pt by ordering Pfam scores. It starts by assigning to st a set of domains Qt, that were obtained by Pfam. Then, it iteratively tries to add to Qt a domain di ∈ Pt if di does not overlap with domains in Qt, and if it co-occurs with some domain in Qt according to L. On the other hand, dPUC represents Pt as nodes in a graph, where edges connect non-overlapping domains in st. It weights each node with normalized Pfam scores, and it weights edges with a special context score that captures the propensity of pairwise domains combinations in L. Then, dPUC finds a set of domains for st by looking for the maximum-weighted clique in the graph.
Table of contents :
1.2 Challenges of detecting remote protein homology
1.3 Significance and contribution of the thesis
1.5 Organization of the thesis
List of Figures
List of Tables
2.1 Structure and function
2.2 Homologous proteins
2.2.1 Remote homologous proteins
2.3 Protein families, domains and motifs
3 Methods for remote homology detection
3.1 Sequence Similarity Searching
3.2 Generative methods
3.2.1 Position-Specific Iterative BLAST
3.2.2 Profile Hidden Markov Models
3.2.3 Methods based on domain co-occurrence
3.3 Discriminative methods
3.3.1 Support vector machine
3.3.2 Inductive logic programming
3.4 Comparing different methods
4 ILP-SVM Homology
4.2.1 Dataset description
4.2.2 Logical representations
4.2.3 Construction of propositional classifiers
4.2.4 Comparison between different methods
4.2.5 Parameter settings and tools used
4.3 Results and Discussion
5 CASH – Combination of Annotations by Species and pHMMs
5.2.2 Selection of representative species from the eukaryotic tree of life
5.2.3 Pfam methodology
5.2.4 Phylogenetic Models
5.2.5 Combining Models Predictions
5.2.6 Resolving protein domain architectures
5.2.7 Prediction analysis
5.2.8 Comparison with earlier results
5.2.9 Visualizing our results
5.2.10 Parameter settings and tools used
5.3 Results and Discussion
6 General conclusions and future work
A List of representative species used in CASH system
B List of Pfam Ribosomal families
C Domains and architectures over-represented in P. falciparum