Knowledge Extraction for Biology and A.Thaliana

Get Complete Project Material File(s) Now! »

Knowledge models, domains and IE

Data, Information, Knowledge, and Wisdom are the transformation steps that take us from raw facts or signals to understanding. The DIKW pyramid (Fig. 1.4b) has often been used to depict the relationships between them. Bellinger et al.[Bellinger et al., 2004] expanded on the definitions proposed by Ackoﬀ [Ackoﬀ, 1989] and proposed the diagram in Figure 1.4a to explain the transformation. Whereas wisdom falls outside the scope of knowledge extraction, the other three concepts are fundamental notions for the domain.
Raw data simply exists, devoid of any significance. Information is data that has been given meaning by relational connections. Knowledge, finally, requires context, organization and structure and even though it is generally considered hard to define [Rowley and Hartley, 2008], it is often tied to a notion of application, in the sense that knowledge is intended to be useful for a given task.
In order to illustrate these nuances in the case of knowledge extraction from text, consider the following sentence: “LEC1 and LEC2 are specifically expressed in seeds”. Starting with the raw data seen as strings of characters, and following their transformation, even after having detected the individual words and the occurrence of genes LEC1 and LEC2 in this phrase, it is still considered data before detecting any relation. Once a human reader or a computer program has understood that there exists a relation of expression between these genes and seeds, we can talk about information. But it is only when this information is put in context by understanding through experience, or an appropriate knowledge model, that we can consider this knowledge. A knowledge model is a formal, consistent representation of knowledge. It can be described using logic, tabular representations, a diagram or graph or any other structured representation of concepts or pieces of knowledge and the relationships between them with a formal semantics attached. Knowledge models have played an important role for decades in the field of Artificial Intelligence and they have been used for knowledge acquisition and engineering applications, decision support, expert systems and a number of other tasks.
The purpose of a knowledge model is to adequately represent the knowledge of the domain or subdomain it describes, and at the same time to provide a representation allowing reasoning and simulation. Explanatory and predictive models of knowledge allow scientists to summarize and explain, share knowledge, formally verify hypotheses and formulate new ones. Knowledge models are generally task-oriented because the representation choices must be driven by the future use. Various types of models exist, each serving diﬀerent purposes and necessitating diﬀerent levels of detail, hierarchy and formality. In Figure 1.5, some typical examples of models are listed by order of complexity and logical formalism. More formal models allow us to calculate the truth value of an assertion, to derive new rules and facts and guarantee formal properties, such as consistency, completeness and minimality.
In knowledge models, knowledge is generally represented as concepts, groupings, relations between concepts and, optionally, rules and instances. Concepts and their relations define types of information and the valid relationships between them. They are the abstraction layer which provides the structure and organization of the information. When a knowledge model is used to annotate data, the occurrences of the defined concepts and relations are added to the structure as instances of these abstract types.
Ontologies are probably the most famous type of knowledge models, as they have been used for decades in a number of diﬀerent domains [Ashburner et al., 2000, Navigli and Ponzetto, 2010, Miller et al., 1990, Kim et al., 2013b]. Ontologies can be defined as «A formal, explicit speci-fication of a shared conceptualisation» [Studer et al., 1998]. In reality, in addition to formal ontologies, the term ontology is used for a number of other representations with varying depths of formality [Guarino and Welty, 2000]. A minimal definition that covers all these scenarios is the following: «An ontology is a specification of a conceptualization» [Gruber, 1993].
In the context of computer and information sciences, an ontology defines a set of representational primitives with which to model a domain of knowledge [Liu and Tamer Özsu, 2009]. Knowledge is represented and organized in classes (or sets), attributes (or properties), and relationships (or relations among class members). These representations exist on a semantic level and they are intuitive for the human mind. Ontologies are written in formal languages such as RDF-S or OWL with expressiveness close to first-order and modal logics. These languages allow abstraction away from specific structures and tools and integration of heterogeneous data sources.
Even though their purpose is not limited to information extraction applications, knowledge models and ontologies, in particular, are an essential part of IE as they are the foundation on which an IE task is [Nédellec et al., 2009]. They define the domain and scope of the task, as well as the necessary and suﬃcient set of entity and relation types. They are the guide and the product of the collaboration with the domain expert and provide a reference for the prediction model as well as any later transformation of the extracted data. In the definition of IE tasks, the description of the corresponding knowledge model is always necessary. It is sometimes given explicitly, like in the example of the GENIES system, for which the model was also published [Rzhetsky et al., 2000], [Friedman et al., 2001]. In other cases, it is implicitly described by the definition of the IE task, as it is the case for the historical MUC challenges.

A short historical survey of projects and applications

The potential of text mining as an alternative method of accessing knowledge was first explored in the contexts of database curation and scientific information retrieval [Craven and Kumlien, 1999, Eilbeck et al., 1999, Pulavarthi et al., 2000, Tamames et al., 1998, Jenssen et al., 2001, Müller et al., 2004, Hoﬀmann and Valencia, 2004]. In systems biology [Ananiadou et al., 2006], text has helped parameter learning for models [Hakenberg et al., 2004], it has often been used to make connections between seemingly dissociated arguments [Weeber et al., 2003, Swanson, 1988, Smalheiser and Swanson, 1994, Srinivasan and Libbus, 2004], and in order to add context and interpretation to experimental microarray data [Krallinger et al., 2005, Oliveros et al., 2000, Blaschke et al., 2001, Shatkay et al., 2000, Raychaudhuri and Altman, 2003, Imoto et al., 2011, Faro et al., 2012].
Most of the early BioNLP projects focused on simple interactions between genes and proteins [Blaschke et al., 1999, Nédellec, 2005a, Yeh et al., 2002, Yeh et al., 2003, Hersh and William, 2004]. More recently, the community has been exploring more ambitious goals with more complex extraction tasks, such as the extraction of more intricate biological events [Kim et al., 2012, Kim et al., 2009a, Kim et al., 2003, Kim et al., 2011, Kim et al., 2004], the ex-traction and reconstruction of networks [Bossy et al., 2013a, Li et al., 2013, Ramani et al., 2005] and pathway curation [Ohta et al., 2013a] tasks.
Historically, the first information extraction projects concerned literature on human, mouse and fly biology [Hirschman et al., 2005, Hersh and William, 2004, Hersh et al., 2006, Hersh et al., 2008, Kim et al., 2003, Ohta et al., 2013a, Ramani et al., 2005]. The LLL challenge [Nédellec, 2005a] was the first to introduce bacterial biology, followed by the BioNLP Bacteria Biotope task [Bossy et al., 2013b, Bossy et al., 2011a]. Plant biology has so far been relatively underrepre-sented as a topic for the BioNLP community. Arabidopsis thaliana has recently seen some initiatives in the field of Information Extraction, such as the KnownLeaf literature curation system [Van Landeghem et al., 2013, Szakonyi et al., 2015]. It is worth noting that there have been other text mining applications on A. thaliana in the past, but they were mostly focused on information retrieval [Krallinger et al., 2009, Van Auken et al., 2012].

Kernels versus Features : Representation versus Algorithm

A common representation for many machine learning algorithms are feature maps, or feature vector representations. These vectors are n-dimensional vectors of numerical values, where each dimension (or feature) represents a measurable property or observation. The process of creating such features, selecting and combining them in order to improve a ML system is called feature engineering.
Feature-based approaches are very popular for RE, with some recent examples be- ing: [Özgür and Radev, 2009, Fayruzov et al., 2009, Reza et al., 2011, Liu et al., 2012, Kambhatla, 2004]. Feature engineering is most often done manually. Kambhatla [Kambhatla, 2004], for example manually constructs a limited set of features combining various syntactic sources for use with a Maximum Entropy classifier. Crafting such features can be a tedious process, but evaluating and selecting the most useful features is also a diﬃcult task. Fayruzov et al. [Fayruzov et al., 2009] study the linguistic features used for Protein-Protein Interaction extraction and find that only a small subset or the features typically used are actually necessary.
A diﬀerent family of machine learning methods, called kernel methods, do not require feature engineering as they are based on similarity functions. These functions calculate the pairwise similarity between two instances and thanks to a method called the “kernel trick” they do not 1Web and Open IE which have diﬀerent algorithmic constraints will not be covered by this section. require a feature vector representation2. The kernel trick takes its name from kernel functions. These functions allow operating in an implicit feature space without ever computing the exact coordinates of data in that space but rather by simply computing the inner products between the images of all pairs of data in the feature space.
A shift towards using kernel methods can be observed in recent years; for example all of the approaches and namely the best ranking ones in the Drug-Drug Interaction (DDI) Extraction SemEval 2013 Challenge used kernel methods [Segura Bedmar et al., 2013]. A representative example of this category will be presented below, with regards to the types of linguistic information they use, since they are most often used with syntactic graphs. Additionally, readers are invited to consult Tikk et al which performed a benchmark of kernel methods for PPI extraction in 2010 [Tikk et al., 2010].

READ Study of silicon nanocrystals inserted in microcavities

Syntactic Information

The link between syntactic relations and semantic ones is intuitive and I consider the use of the former to predict the latter to be the best approach. This thesis was also explored by Bunescu and Mooney who state that the extraction accuracy increases with the amount of syntactic information used [Bunescu and Mooney, 2005].
However, while deeper representations promise better generalization and semantic relevancy, they inevitably suﬀer from errors in computing these representations [Zhao and Grishman, 2005]. The direction to take in developing RE systems is the optimization of the use of syntactic information, while taking care to take advantage of shallow information in order to avoid missing information because of errors.
The first learning approaches were those meant to facilitate the production of extraction patterns [Huﬀman, 1995, Agichtein and Gravano, 2000, Brin et al., 1998], which rapidly evolved to take advantage of syntactic information [Park et al., 2001, Yakushiji et al., 2001, McDonald et al., 2004]. Diﬀerent approaches avoiding using any syntactic information include treating relation extraction as a sequence labeling task [Culotta et al., 2006].
Approaches using minimal syntactic analysis include the HMM system of Ray and Craven [Ray and Craven, 2001], using just Part-Of-Speech tagging. Going a bit further, shallow parsing (or chunking) has been a popular choice in the kernel-based approaches.
Going again in the same direction, Pustejovsky et al. [Pustejovsky et al., 2002] used shallow parsing and sophisticated anaphora resolution. Zelenko et al. [Zelenko et al., 2003] and Mooney et al. [Mooney et al., 2006] proposed kernel-based approaches on shallow parse trees which gathered a lot of attention, with the latter using them as a sequence, in an approach which reminds of the system of Culotta et al. [Culotta et al., 2006]. Shallow parse tree kernels have seen continued use [Claveau, 2013, Segura-Bedmar et al., 2011] and have been shown to outperform fuller parses in some cases [Giuliano et al., 2006], confirming the hypothesis that parsing errors which occur more frequently in fuller parsers can have a significant impact in the result of RE.

Linguistic Pre-processing

AlvisRE requires that the following steps have always taken place in a pre-processing phase. While a detailed account of this linguistic analysis and choices can be found in [Ratkovic, 2014], an outline of the necessary steps is given for context.

Tokenization and segmentation

Tokenization and segmentation play an important role in the performance of relation extraction in the biomedical domain [Jiang and Zhai, 2007], as they aﬀect both the POS tagging and parsing steps.
Tokenization is based upon the notion of tokens. A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. In this work a token is considered not only as an atomic unit, but also as a more linguistically motivated basic unit of meaning (or word). Cases where the tokens are not words include names of proteins or species including whitespace (e.g. “Bacillus subtilis”), names of proteins and genes containing punctuation or numbers (e.g. “sigma (A)” and “A. thaliana”) and other special cases such latin names, numbers, DNA sequences, etc (see [Ratkovic, 2014] for more details).
The AlvisNLP/ML pipeline contains two tokenization/segmentation modules: WoSMIG and SeSMIG, for word and sentence segmentation respectively. They rely on a combination of regular expression rules, domain dictionaries, heuristics and semantic annotation rules.

Lemmatization and normalization

Both lemmatization and (token) normalization refer to processes that group together diﬀerent forms of the same linguistic object, so that they can be analyzed as identical afterwards. Lemmatization does so by grouping diﬀerent inflected forms of a word and returning the base or dictionary form of a word, known as the lemma. Stemming is a simplistic approach for the reduction of derivation and lemmatization, as it consists of chopping oﬀ the ends of words heuristically.
Normalization groups tokens which match despite superficial diﬀerences in their character sequence and returns a canonical form (e.g. “Arabidopsis thaliana”, “A. thaliana” and “Arabidopsis Thaliana”).
While lemmatization and (surface) normalization are not obligatory, they can greatly influence the performance of AlvisRE, as it is directly dependent on the ability to correctly calculate the similarity between tokens, as will be shown later in this section.

Syntactic Parsing

Syntactic parsing is optional in some representation alternatives, as will be illustrated below, but AlvisRE was built with dependency-based parsing in mind. Zorana Ratkovic [Ratkovic, 2014] tested three parsers and chose to integrate CCG [Rimell and Clark, 2008] in AlvisNLP/ML. In that work, AvisRE has been tested with CCG (standard and transformed by Alvis Grammar) as well as Enju [Miyao and Tsujii, 2005]. These tests have shown that CCG (standard or optimized) produces the best input for AlvisRE. Constituent-based parsers and other dependency-based parsers have not been tested at this time.

Table of contents :

List of Figures
List of Tables
Abstract
Introduction
1 Problem Statement
1.1 Background of the Problem
1.2 Purpose and Significance of the Study
2 Research Questions
3 Thesis Contributions
3.1 Research Design and Limitations
4 Thesis Outline
1 Background & Related Work
1.1 Introduction
1.2 Biological Background
1.2.1 Why the seed development of Arabidopsis thaliana?
1.2.2 A. thaliana regulatory network basics
1.3 The seed development network in A. Thaliana
1.4 Knowledge Extraction
1.4.1 Knowledge models, domains and IE
1.4.2 Knowledge Extraction for Biology and A.Thaliana
1.5 Knowledge Expressed in Text: the Corpus
1.5.1 The building blocks: entities and semantic relations
1.5.2 Corpus
1.6 Information Extraction
1.6.1 Defining IE
1.6.2 Evaluating IE systems
1.6.3 IE Systems
1.7 The Alvis ecosystem
1.8 Conclusion
2 Data
2.1 Introduction
2.1.1 Tasks & Development Phases
2.1.2 People
2.2 Tools
2.2.1 Collaborative Tools
2.2.2 Documentation
2.2.3 Annotation
2.3 Model
2.3.1 Conceptual Model
2.3.2 Annotation Model
2.3.3 Model Transformations
2.4 Corpus
2.4.1 Source
2.4.2 Corpus Annotation
2.5 Results and Discussion
2.6 Conclusion
3 Relation Extraction
3.1 Introduction
3.1.1 Information Extraction
3.2 Linguistic Pre-processing
3.2.1 Tokenization and segmentation
3.2.2 Lemmatization and normalization
3.2.3 Syntactic Parsing
3.3 Representation
3.3.1 Introduction
3.3.2 From text files to complex sentence objects
3.3.3 From sentences to candidate relations
3.3.4 From candidate relations to paths
3.3.5 From paths to a machine-learning ready representation
3.4 Classification
3.4.1 Introduction
3.4.2 Support Vector Machines
3.5 Adding Semantic Information
3.5.1 Introduction
3.5.2 Manual Classes
3.5.3 WordNet
3.5.4 Textual similarity
3.5.5 Distributional Semantics
3.6 Results
3.6.1 Experimental setup
3.6.2 Experimental validation
3.7 Discussion
3.8 Conclusion
4 Relation Extraction on A. thaliana
4.1 Introduction
4.2 SeeDev in BioNLP-ST ’16
4.3 AlvisRE on Arabidopsis thaliana
4.3.1 Optimizing the SVM margin parameter
4.3.2 Exploring different relation types
4.3.3 The impact of model transformations on Relation Extraction
4.3.4 AlvisRE compared to the SeeDev participants
4.4 Discussion
Conclusion & Future Work
1 Conclusion
2 Future Work
2.1 Data
2.2 Information Extraction
Appendices
A Manual Classes for LLL
B Trigger words for Bacteria Biotopes
C A complete AlvisRE input file
D The list of the articles in the Arabidopsis Thaliana corpus.
E The Arabidopsis Thaliana Seed Development conceptual model in detail.
F The Arabidopsis Thaliana Seed Development annotation model in detail.
Bibliography