Mining denitions from RDF annotations using Formal Concept Analysis

Get Complete Project Material File(s) Now! »

Knowledge Discovery in Databases

A large amount of data has produced belonging to all the elds such as biomedical data, customer data, nancial data etc. It is very important to create new algorithms and processes to eectively extract knowledge from the existing data or using the existing one. Knowledge Discovery in Databases (KDD) is mainly concerned with providing a number of techniques for making sense of these data. The main goal of the KDD process is to map low-level data which is usually large in volume into compact representation which are easily interpretable such as patterns.
Multiple algorithms for discovering unknown hidden patterns are used for traversing from data to knowledge which may be used for understanding or prediction. The phrase KDD was rst introduced in Fayyad et al. (1996a,b) to emphasize on the fact that knowledge is the end product of the data driven discovery. Figure 1 depicts the KDD process step by step. Following is the description of each phase of the KDD process:
Selection. The selection process develops a clear understanding of the application domain and the relevant prior knowledge. Identication of the goal of the KDD process from the user point-of-view is also an important task. Based on this understanding and overall plan the dataset or subset of the data set is selected on which a variety of KDD tasks are performed.
Preprocessing. During the preprocessing step the data is cleaned and processed. It includes noise reduction if needed and collecting necessary information through which noise can be taken into account and modeled. It also includes designing a way to deal with missing data elds.
Transformation. includes data reduction and projection which focuses on nding useful features representing the data depending on the goal. Dimensionality reduction or transformation methods are used to reduce the eective number of variables under consideration or to nd invariant representations for the data. After the transformation of data, the goal of the KDD process should be identied i.e., whether it is summarization, classication, clustering or prediction. This process includes selecting methods to be used for searching for patterns in the data and deciding which models and parameters may be appropriate.
Finally, matching a particular data mining method with the overall criteria of the KDD process.

Exploratory Data Mining and Web of Data

Recently, there has been a substantial growth in the interest of publishing data on-line in the form of entity-relationship i.e., Resource Description Framework (RDF)1 and in the form of ontologies. There has been an explosion of data sources published in RDF and then linked to other data sources called Linked Data (LD) Bizer et al. (2009a). It provides a set of principles to be followed for publishing data in this format. These data are then accessible through many ways i.e., in the form of RDF dumps, crawling from one entity to another by following edges and through querying using standard query language SPARQL. This increased interest in the publication of data in RDF format gives rise to many interesting challenges.
One of the challenges faced by web search engines is that they retrieve a long list of answers. The user has to sift through all the answers to retrieve the relevant ones. To solve this problem Web Clustering Engines (WCE) Carpineto et al. (2009) were introduced. WCE basically allow the user to send query to search engine and cluster the ranked results (documents) for displaying it to the user. Such a clustering helps the user in selecting only those web pages which are more related to her needs. Previously, we mentioned that Web of Documents has turned into a Web of Data which is accessible through SPARQL queries. The answers generated by SPARQL query are in the form of a list. Even if the user gets hundreds of answers she is unable to learn the structure of these answers or nd the hidden patterns. She faces a similar problem of sifting through thousands of irrelevant answers to get the relevant ones. Moreover, the navigation through this sea of answers creates problems of interpretation. In this thesis we target this problem and provide user with navigational capabilities over these answers by following the KDD process using FCA as a clustering algorithm. We also discuss implementations of a visualization tool which allows guided navigation and user interaction.
One of the major issues with the crowd-sourced resources on Linked Data such as DBpedia is that they contain some missing information. DBpedia is the RDF version of Wikipedia info boxes. It also uses Wiki-categories i.e., the hierarchy of categories introduced by Wikipedia for organizing Wikipedia articles. In the current thesis we use association rule mining to complete RDF-based descriptions of DBpedia using Wiki-categories.
In the past there have been several studies which apply FCA to RDF graphs such as Kirchberg et al. (2012) and d’Aquin and Motta (2011). Both the studies directly apply standard FCA to deal with such a structure and do not directly target the heterogeneity present in such a data due to tree structure present in RDF Schema or ontology (we call it background knowledge). Following this paradigm, RDF triples can be clustered w.r.t. this background knowledge. There are several approaches which provide document clustering with the help of Formal Concept Analysis using background knowledge. We use these methodologies for providing navigation and exploration over the part of RDF graph which are relevant to the user. This approach constitutes a way of clustering SPARQL query answers based on background knowledge, where the background knowledge is contained in the Knowledge Base over which the SPARQL query is posed.

Iceberg Concept Lattice

Concept lattices sometimes contain a huge number of concepts. In order to restrict the number of concepts for facilitating interpretation and only obtaining frequent concepts, iceberg concept lattices were introduced in Stumme et al. (2001). Iceberg concept lattices contain only the top most part of the lattice. For a given concept (A,B), the support of B is the cardinality of A denoted by |A|. Relative support is given by |A|{|G|, and belongs to the interval r0; 1s where |G| is the total number of objects in the context K.
Denition 8. Let B M and let minimum support denoted by minsupp P r0; 1s. The support of the attribute set (also called itemset) B in K is supppBq |B1|{|G|. An itemset B is said to be frequent if supppBq ¥ minsupp.
A concept is called a frequent concept if its intent is frequent. The set of all frequent concepts of K, for a given threshold, is called an iceberg concept lattice of K Stumme et al. (2001). Because the support function is monotonously decreasing i.e., B1 B2 ñ supppB1q ¥ minsupp, the iceberg concept lattice is an order lter of the concept lattice and thus a join semi-lattice. Consider a complete concept lattice in Figure 1.1, let us consider that the minsupp 2 for computing the iceberg concept lattice then the concepts having the extent size less than 2 will not be generated. The resulting concept lattice is shown in Figure 1.2.

Relational Concept Analysis

Beside class hierarchies provided by concept lattices, an integrated class model must include relations available between classes, and possibly, abstractions of these relations. The abstraction of relations requires an encoding of roles into a formal context together with their attributes. The Relational Concept Analysis framework addresses these concerns, allowing FCA to take eectively
and eciently into account relational data. Relational Concept Analysis (RCA) Hacene et al. (2007), Rouane-Hacene et al. (2013) is an extension of FCA which is close to Entity- Relationship model for relational databases. The datasets provided as an input for RCA are the same as FCA i.e., formal contexts containing object-attribute relations. In addition to formal contexts it contains relational contexts consisting of relations between the object sets of two formal contexts. This relational context family is dened as follows: Denition 12 (Relational Context Family (RCF)). An RCF is a pair pK;Rq where: K tKiui1;:::;n is a set of contexts Ki pGi;Mi; Iiq.

READ Extended Technology Acceptance model development

Formal Concept Analysis for Web Clustering Engines

Formal Concept Analysis is a conceptual clustering technique which simultaneously generates clusters and the cluster labels. Several FCA-based web clustering engines were introduced. These studies follow the fundamental architecture of Web Clustering Engines as detailed in the previous section. Section 1.5.1 briey details how FCA is used in the scenario of web clustering engines and user interaction is allowed in some of the FCA-based web clustering engines. Section 1.5.2 discusses the studies which take into account background knowledge in the form of a taxonomy for clustering documents.

Clustering Web Search Results

CREDO was one of the rst FCA-based web clustering engines introduced in Carpineto and Romano (2004b). CREDO, it stands for Conceptual REorganization of DOcuments. It takes as an input a user query. The query is forwarded to an external Web search engines and only rst 100 pages returned by the search engines are used for further processing. CREDO parses only snippet from the retrieved document to extract the set of index terms using standard natural language processing tasks. These tasks involve text segmentation, word stemming, stop wording, word weighing. Afterwards a concept lattice is designed from the document-term relation. The concept lattice of the retrieved documents may contain many irrelevant concepts resulting from spurious combinations of the document terms. CREDO takes a hybrid approach, in which the lower levels of the CREDO hierarchy are built using a larger set of terms than those used to build the top level. CREDO follows a two-step classication procedure, in which the rst layer identies the main topics and the other layers contain the subtopics of each main topic.
Finally the obtained CREDO hierarchy is visualized using conventional tree-folder display, where the top contains all the documents. On click it shows the children of the top which classify the documents based on the terms. The user can click on one concept and see its children, thus narrowing down the scope of the search. This operation can be repeated on the newly-displayed concepts to further narrow the scope of the search, or it can be performed on other top concepts to browse unexplored branches. Figure 1.11 shows a snapshot of CREDO for the query Jaguar. CreChainDo. Nauer and Toussaint (2007) improved the system CREDO by allowing the user to provide feedback instead of just navigating the concept lattice. CreChainDo allows the user to mark the concepts which are relevant or irrelevant to the user. If the user marks a concept as irrelevant then all the sub-concepts of this concept are also marked irrelevant. After the user has marked this, a new lattice is computed by removing the irrelevant objects or attributes. This way the iterations continue until a x-point is achieved where the concept lattice contains only the relevant documents. Figure 1.12 shows the snapshot of the system CreChainDo for the query Carpineto Romano.

SPARQL Queries with Classication Capabilities

The idea of introducing a VIEW BY clause is to provide classication of the results and add knowledge discovery aspect to the results w.r.t the variables appearing in VIEW BY clause. Let Q be a SPARQL query of the form Q = SELECT ?X ?Y ?Z WHERE {pattern P} VIEW BY ?X then the set of variables V t?X; ?Y; ?Zu 18. According to the denition 16 the answer of the tuple pV; Pq is represented as rrpt?X; ?Y; ?Zu; Pqss i where i P t1; : : : ; ku and k is the number of mappings obtained for the query Q. For the sake of simplicity, |W is given as . Here, dompiq t?X; ?Y; ?Zu which means that p?Xq Xi, p?Y q Yi and p?Zq Zi. Finally, a complete set of mappings can be given as tt?X Ñ Xi; ?Y Ñ Yi; ?Z Ñ Ziuu.

Table of contents :

Chapter 1 Formal Concept Analysis and Web Clustering Engines
1.1 Formal Concept Analysis
1.1.1 Many-Valued Context
1.1.2 Iceberg Concept Lattice
1.1.3 Stability Index
1.1.4 Association Rule Mining
1.2 Pattern Structures
1.3 Relational Concept Analysis
1.4 Interactivity over Web Clustering Engines
1.4.1 Exploratory Data Mining
1.4.2 Web Clustering Engines
1.4.3 Interactive Exploration over Web Search Results
1.5 Formal Concept Analysis for Web Clustering Engines
1.5.1 Clustering Web Search Results
1.5.2 Adding Background Knowledge to Concept Lattices
1.6 User Interfaces for Navigating Concept Lattices
1.6.1 Tree-Folder Display
1.6.2 Systems using Hasse Diagram
1.6.3 User Friendly Interface.
1.6.4 Querying
1.6.5 Categorization of Lattice Visualisation Tools
1.7 Discussion
Chapter 2 Semantic Web
2.1 From Web of Documents to Semantic Web
2.1.1 Schema.org
2.1.2 Linked Data
2.2 Principles of Linked Data
2.2.1 Resource Description Framework (RDF)
2.2.2 SPARQL
2.2.3 Accessing and Consuming Linked Data
2.3 Clustering SPARQL Query Answers.
2.4 RDF Graphs and Formal Concept Analysis
2.5 Discussion
Chapter 3 Lattice-Based View Access
3.1 Introduction
3.1.1 Motivation
3.2 Lattice-Based View Access
3.2.1 SPARQL Queries with Classication Capabilities
3.2.2 Designing a Formal Context of Answer Tuples
3.2.3 Building a Concept Lattice
3.2.4 Interpretation Operations over Lattice-Based Views
3.3 Tool for Interaction with SPARQL Query Answers
3.3.1 Motivating Example
3.4 The RV-Xplorer
3.4.1 Local View
3.4.2 Spy
3.4.3 Statistics about the next level
3.5 Navigation Operations
3.5.1 Guided Downward (Drill down)/ Upward Navigation (Roll-up):
3.5.2 Direct Navigation
3.5.3 Navigating Across Point-of-Views
3.5.4 Altering Navigation Space
3.5.5 Area Expansion
3.5.6 Hiding Non-Interesting Parts of the View
3.5.7 Other Functionalities
3.6 Experimentation
3.6.1 YAGO
3.6.2 DBpedia
3.6.3 Evaluation
3.6.4 Application to Biomedical Data
3.7 Discussion
Chapter 4 Mining denitions from RDF annotations using Formal Concept Analysis
4.1 Introduction
4.2 Improving DBpedia with FCA
4.2.1 Problem context
4.2.2 The completion of DBpedia data
4.2.3 Pattern structures for the completion process
4.2.4 Heterogeneous pattern structures
4.3 Experimentation
4.3.1 Evaluation
4.4 Discussion
Chapter 5 Revisiting Pattern Structures for Structured Attribute Sets
5.1 Introduction
5.2 Pattern Structures for Structured Attributes
5.2.1 Two original propositions on structured attribute sets
5.2.2 From Structured Attributes to Tree-shaped Attributes
5.2.3 On Computing Intersection of Antichains in a Tree
5.3 Experiments and Discussion
5.4 Discussion
Chapter 6 Exploratory Data Analysis of RDF Resources using Formal Concept Anal- ysis
6.1 Introduction
6.1.1 Motivating Example
6.2 Towards RDF-Pattern Structures
6.2.1 From RDF Triples to RDF-Pattern Structures
6.2.2 Similarity Operation Over Classes
6.2.3 Building the RDF Pattern Concept Lattice
6.3 Navigation and Interactive Exploration over Pattern Concept Lattice
6.3.1 Navigation Operations
6.3.2 Interactive Data Exploration over Navigation Space
6.4 Experimentation
6.4.1 Drug Search
6.4.2 DBLP
6.4.3 Visualization
6.5 Related Work
6.6 Discussion
Chapter 7 Conclusion and Perspectives
7.1 Summary of Contributions
7.1.1 Lattice-Based View Access (LBVA)
7.1.2 Mining denitions from RDF annotations using Formal Concept Analysis
7.1.3 Pattern Structures for Structured Attribute Sets
7.2 Perspectives
7.3 List of Publications
Appendix
Appendix A
A Study on the Correspondence between FCA and ELI Ontologies
A.1 Introduction
A.2 Preliminaries
A.3 Transforming ELI Ontologies into Formal Contexts
A.3.1 Motivation
A.3.2 Proposal
A.4 Querying Concept Lattice
A.5 SPARQL query answering over ontologies vs LQL query answering over concept lattices
A.6 Related work
A.7 Conclusion
Appendix B Enriching Transcriptomic Data with Linked Open Data
B.1 Introduction
B.2 Enriching Transcriptomic Data with Hierarchical Information from Linked Data120
B.3 Complex Biological Data Integration
B.3.1 Molecular Signature Database (MSigDB)
B.3.2 Domain Knowledge
B.4 From Data to Knowledge
B.4.1 Test Data Sets
B.4.2 Using FCA for Analyzing Genes
B.5 Results
B.6 Conclusion