This part presents and explains the main concepts and mechanisms that compose the theoretical base of the thesis work. The following notions will be presented in the subsequent sub-sections;
• Ontology (section 2.1)
• Automatic ontology construction methods (section 2.2)
• Protégé environment (section 2.3)
• Information extraction (section 2.4)
• String matching (section 2.5)
• Ontology design patterns (section 2.6)
• Software requirement and design description (section 2.7)
A great amount of definitions for the term ontology can be found in the computer science literature, therefore this section give an overview of those definitions and also presents some ontology representation languages.
Different definitions more or less precise can be found for ontology. According to  an ontology is “an explicit specification of a conceptualization”, where a conceptualization is a simplified representation of an area of the real world. For instance an ontology on the cinema would include information about the number of rooms in the cinema, the number of seats in each room, the size of the screen in each room, etc. This definition does not indicate the importance of the relations between the objects used in the conceptualization.
Another more precise definition can be found in  and , it defined ontology as “A hierarchically structured set of concepts describing a specific domain of knowledge that can be used to create a knowledge base. Ontology contains concepts, a subsumption hierarchy, arbitrary relations between concepts, and axioms. It may also contain other constraints and functions”.
In Figure 2-1 an example of an ontology is shown, in this ontology hierarchy links are made between the pizza names (MargueritaPizza, AmericanaPizza, etc.) and the “NamedPizza” concept. Also the pizzas are divided into two categories “CheesyPizza” and “VegetarianPizza”.
During the past years, ontologies have been largely used in the Knowledge Management area, in order to build applications that are based on common knowledge for a specific domain (e.g. the Gene Ontology3), or knowledge-based service that are able to use the Internet as described in . This is due to the fact that an ontology aims at reusing and sharing knowledge across systems and the users of those systems . For instance in  an application has been used to automatically query a knowledge base at hand, and on other hand generate biography of artists.
Ontology representation languages
According to the previous definitions, ontology helps in describing a domain of knowledge. Consequently, this domain of knowledge needs to be represented in a machine understandable language in order to perform basic operations such as query or storage. As a result, different classes of languages allow representing ontology; Frame-based languages, Description Logics-based languages, XML-related languages, etc .
Frame-based languages: In  a frame is defined as “a data structure that provides a representation of an object or a general concept”. Frames can be considered as classes in object-oriented languages but without methods .So called slots are used to represent frame attributes and associations between frames . Examples of Frame-based languages are; Ontolingua (also used for the name of the system compatible with the language), OKBC (Open Knowledge Base Connectivity) , etc.
Description Logics-based languages: Permit thanks to a formal logic-based semantics, to represent the knowledge of a specific domain in a well structured way . Description logics main idea is to use basic sets of concepts and binary relations to create complex concept and relation expressions . Examples of Description Logics-based languages are; ALC, DAM+OIL, OWL (Ontology Web Language), etc.
XML-related languages: In addition to validate XML (eXtensible Markup Language) documents XML-related languages can be used to represent and perform operations on information (metadata) contained on the Web documents . The languages used for ontology representation are RDF (Resource Description Framework) and RDFS (RDF Schema). Both RDF and RDFS are defined in XML syntax. The main idea of RDF is to use resources (e-g a web page) and properties (a specific attribute of a resource) to create statements in form of “subject-predicate-object expressions” . Here a subject is considered as a resource. RDFS enriches RDF by providing mechanisms to structure RDF resources such as defining restrictions on resources, defining classes and subclasses,  etc.
Automatic ontology construction method
Since several ontology construction methods are available in the literature, this section aims at, giving an overview of existing AOC approaches, and finding differences and similarities among the different approaches. Finally the ontology construction framework of the prototype system is presented.
Existing automatic ontology construction approaches
As previously said, the purpose of the thesis is to implement a prototype based on the AOC process described in . Nevertheless other AOC processes have been used in other areas and for different purposes as in .
In  the methodology used for automatically building ontologies consists of applying information extraction tools on online web pages and then combining this information with an ontology and the WordNet lexicon to populate a knowledge base. This knowledge base is finally queried to automatically construct biographies about artists. One difficulty faced by this experiment was the duplicate information in the documents that created redundant explanations. This difficulty is also mentioned in  as a problem in the AOC approach used by the semantic agent InfoSleuth. A proposal solution in  to solve the problem of different sentences that refers to the same concept is, “differentiate them via the co-occurrence frequency”, that is to say; take into account how often the same sentence appears in text. Though the process we intend to implement is different there are similarities in the extraction of knowledge step since we need to extract terms and relations or associations in a text corpus.
In  ontologies are automatically built from statistical treatment of biological literature. The aim of the method used in  is to extract terms from their frequency of appearance in the documents and the group of gene products. Key-terms are also extracted for the associated genes. Although the method described in  has produced satisfying results such as identifying easily the genes that share common information in the literature after the automatic classification of the genes, this method is not suitable for our purpose since the ontologies are built by grouping concepts that have similar information and functions in the literature to enrich the GO ontology.
In  a different method for AOC is proposed based on tree main sources, a technical text corpus, a plant dictionary and finally a multilingual thesaurus. A different term extraction approach  and  is used in this case, the approach for terms extraction uses a Shallow Parser. The methodology used in  for ontology construction can be summarized in three steps; i) extraction of terms based on text corpus, ii) dictionary based ontology extraction to extract relational information with other plants since the ontology domain is plants iii) thesaurus translation to ontology terms. One main advantage of this methodology is its relatively high reliability, 87% accuracy for the system, the 13% of error is due to terms extraction errors.
As suggested in  ontologies are automatically constructed by reusing existing knowledge. The method used aims at improving the reliability problem of automatically generated ontologies. In  the summarized process for building ontologies is; i) constructing a frame ontology for a specific domain from WordNet lexicon, ii) combine knowledge from domain expert with the frame ontology previously built. One disadvantage of the suggested process in  is that the knowledge from the domain expert is not collected automatically.
Although the AOC process suggested in  is also based on reuse the approach used is different than the one in . In  the goal is to generate ontologies from existing ontologies by using an ontology search engine to find different ontologies of the same specific domain and then combine fragments of those ontologies to construct a more complete ontology. This approach is efficient since well structured ontologies that have already been checked by domain expert are reused. Another advantage is that more and more ontologies are available via ontology search engines. On the other hand the approach suggested in  is inefficient, when it comes to building ontologies for a new domain when few ontologies are available through search engines, or when the available ontologies are not reliable because of a lack of domain experts.
Automatic ontology construction method for the prototype
The prototype system is intent to follow at a first stage, the general framework for automatic ontology construction developed by the Information Engineering research group of the School of Engineering, Jönköping University, as presented in . A next stage for the future update of the prototype could be to add a different methodology for AOC.
The main idea of the ontology construction approach presented in  is, to extract terms and concepts from a text corpus (a text corpus is a set of text files), match those extracted terms against the terms and concepts contained in a set of ODPs and afterwards select the patterns that best match the extracted terms and associations to construct the ontology. The steps followed to construct an ontology automatically are:
1. Construct ontology design patterns.
2. Extract terms from a text corpus.
3. Match extracted terms against concepts in patterns.
4. Extract associations from a text corpus.
5. Match the extracted associations against associations in patterns.
6. Calculate a matching score that reflect the matching process of the extracted terms and associations against the ODPs.
7. Select the successfully matched ODPs, that is to say the ones that have the most concepts and associations that match the extracted terms and associations.
8. Construct ontology with selected patterns, and extracted terms and associations.
A common step that can be found among the ontology construction framework presented in  and the others presented in  is that, all of them are using terms extraction. On the other hand only the approach in  uses ODPs and a threshold for patterns selection.
A comparative study presented in  has shown that, it is not yet possible to measure the difference between manual construction approaches and this method based on ODP since; the main concepts are included as part of the ontology in priority since, they are used by the enterprise. However, when using the approach in , the main concepts are not included in priority in the ontology since; the method includes only concepts that are in the ODPs.
Protégé is a freely available environment for ontology construction, it was developed using the Java language at Stanford University4. In addition to be open-source, various plugins are available for extending ontology construction, constraint axioms and integration functions5. Protégé is available in two different frameworks, Protégé-Frames and Protégé-OWL, for our purpose, we will use the Protégé -OWL since it was a requirement that the prototype system supports this framework for both ontology and ODP construction6.
Ontology building with Protégé
As previously said, Protégé is an environment for ontology building, it has been used in several projects, as  and  during the process of automatic ontology building. In , Protégé is not used to build the ontology but to link a knowledge base and an ontology server. In , Protégé is used to build an e-learning knowledge base ontology, this knowledge base is then combined with web services in order to provide dynamic course construction.
Protégé is an extensible environment, through the use of many freely available plugins, an interested reader is advised to see the Protégé web site for more information. For instance, Query Export Tab – permits to query Protégé knowledge bases – Oracle RDF Data Model – deals with OWL ontologies and the Oracle RDF (Resource Description Framework) format. Other plugins have also been developed for other purposes, in  a plugin has been used to permit Protégé to support storage of RDF queries through the external application Sesame. In  a plugin has been developed to enable Protégé environment to create ontologies in the ontology web language (OWL). Six types of Protégé plugins can be identified; application, backends (Knowledge Base Factory), import/export plugin, project plugin, slot widget and finally tab widgets.
As described in  there are many advantages for using the Protégé environment such as, it is highly customisable for user interface and output file format, it has an extensible architecture that permits to integrate external applications, etc. Due to those advantages, and the requirements of the thesis work, our prototype will be implemented as a tab widget plugin for the Protégé environment. The functionalities, for creating an ontology, of the Protégé environment will be reused as a basis for constructing ODPs.
Ontology representation in Protégé
The Protégé-OWL framework uses different terms for the entities that compose an ontology than, the terms used in the ontology definition presented in section 2.1. This section aims at presenting the vocabulary used in this Protégé-OWL framework. A complete presentation of the OWL ontology components can be found in .
• Individuals: they are equivalents of concept instances. Examples of individuals for the concept colour are; red, green, yellow, etc.
• Properties: they are identical to concept associations, they have cardinalities, and they can be transitive or symmetric. In Protégé-OWL, the components of the properties are called property domain and property range. An example of property for the concepts “person” and “car” is Drive, the property domain is “person” and the property range is “car”.
• Classes: they are equivalent to ontology concepts. In Protégé-OWL all the classes are considered as subclasses of the class OWL:Thing.
• Class hierarchy: equivalent to taxonomy.
• Disjoint classes: permits to specify that individuals of several classes should not overlap, so that they cannot be instances of more than one class.
For the purpose of the prototype system, the previously introduced definitions will be used for representing the concepts, associations, instances, in both ontology design patterns and the generated ontology.
Terms extraction is a field of the information extraction domain. Information extraction aims at extracting the most valuable information from either, structured documents as HTML pages or, unstructured documents as natural language document. As shown in section 2.2.1 it is one of the main prior steps of several AOC approaches. Information extraction is required in two steps of the ontology construction framework presented in ; firstly for extracting terms from a text corpus, and secondly for extracting associations from a text corpus. In this part we will focus on different term extraction methods and different tools that have implemented those approaches.
Information extraction methods
Dictionary-based extraction methods
A definition of dictionary-based extraction is presented in ; the method “uses existing terminological resources in order to locate terms occurrences in a text”. In other words, a set of concepts are stored in the dictionary and afterwards this dictionary is reused together with information learning methods to extract terms. An example of dictionary based extraction is presented in .
Shallow text processing
Shallow parsers allow extracting and representing linguistic structures from texts in compact data structures. They are founded on natural language components and generic linguistic knowledge. Finally, they permit to efficiently identify relations among a set of concepts . If we consider an average size set of concepts, a large set of relations can be generated considering a combination of concepts without considering natural language rules , therefore, shallow parsers allows adding restrictions for the relations so that non-sense can be avoided.
1.4 THESIS OUTLINE
2 Theoretical Background
2.2 AUTOMATIC ONTOLOGY CONSTRUCTION METHOD
2.3 PROTÉGÉ ENVIRONMENT
2.4 INFORMATION EXTRACTION
2.5 STRING MATCHING
2.6 ONTOLOGY DESIGN PATTERN
2.7 REQUIREMENT SPECIFICATION AND DESIGN DESCRIPTION
4.1 REQUIREMENT SPECIFICATION FOR THE PROTOTYPE SYSTEM
4.2 DESIGN OPTIONS AND DECISIONS
GET THE COMPLETE PROJECT