A survey on Formal Concept Analysis based Information Retrieval approaches

Get Complete Project Material File(s) Now! »

Calculating a concept lattice

Several algorithms have been proposed to calculate, from a given formal context, the set of all formal concepts with its associated concept lattice structure [87]. Furthermore, dierent implementations are freely available online for most of these algorithms (e.g. Conexp-ng 9, ToscanaJ10, Coron11, Galicia12, etc.). For all the experiments in this thesis we have developed our own system called Sephirot13 which implements the AddIntent algorithm for concept lattice calculation [139].

The Boolean Retrieval Model

The Boolean retrieval model is considered as the rst and one of the simplest techniques to index and retrieve documents [6, 93]. Given a collection of documents G, we consider each document g as represented by a conjunction of Boolean descriptors g0 M, where M is the set of all descriptors (sometimes called repertory or dictionary). A query (request, or prescription) is dened as a set of descriptors connected by a logical operator AND;OR;NOT. Traditionally, descriptors in the Boolean IR model are terms within documents.
It is worth noticing that in this description we have jumped through many well-dened procedures in IR. For example, the operation to obtain the descriptors of a given document that we naturally associate with the derivation operator dened in FCA (g0), in IR is modelled through an inverted index which associates dictionary entries (descriptors) with postings (documents). Furthermore, in our setting we consider documents as already described by a set of conjunctive descriptors. In reality, documents are built from unstructured text and several techniques are available to obtain a reduced set of descriptors from it (e.g. parsing, tokenizing, lemmatization, disambiguation, etc. [93]). All these procedures are by no means irrelevant. In fact, they can be considered critical to the eectiveness of the IR tool they support. Nevertheless, they are not a part of our model and we consider them as given, i.e. we consider that document corpora are already in a document-descriptor format. This is actually true for many of the test datasets available and for all we use through this document. We will provide a further description for
some of them in Section 2.4, Chapter 2.

A survey on Formal Concept Analysis based Information Retrieval approaches

Surveying the intersection of Formal Concept Analysis (FCA) [63] and Information Retrieval [6] is not an easy task. The main complexity is that both domains have an application range so wide that just getting a relevant set of articles to report about is a knowledge discovery process in itself. This is clearly exemplied by the survey presented by Poelmans et al. in 2012 [118] where FCA is used to report about 103 articles related to topics of FCA and IR in a period of only six years (2003-2009) crawled from the Web. In this chapter we intend to approach the surveying in a more general and integral manner. We try to answer a very simple question. How FCA and concept lattices have been used in the context of IR applications? We answer this in a chronological narration, trying to cover the last 25 years of research since the rst inception of the use of lattice structures to model the space of possible queries (or prescriptions, as they were called) to the last approaches, supporting le systems and semantic technologies. As we can observe, most of the approaches presented here rest over a limited pool of ideas and techniques associated with FCA/IR but applied to a myriad of domains and applications. These ideas are:
1. Using a concept lattice as a model of the description and document spaces.
2. Enriching the description space through external knowledge sources.
3. Enabling Relevance Feedback.
Mixing querying and browsing.
Query-by-navigation.
Query-by-example.
4. Using a concept lattice as a support for automatic retrieval.
Our goal in this survey is two-fold. Firstly, we want to catalogue these ideas so future endeavours may have an easier way reaching further domains while developing new dierent and more interesting techniques. Secondly and perhaps more important we set the theoretical background in which the rest of this thesis is rooted.

Pre-FCA history – A lattice to model the description and document spaces

Lattice structures were early adopted by information scientists as a model of document indexing [54, 105]. As early as 1956, Robert A. Fairthorne [54] discussed how to model a library classication system by producing all possible requests (queries) as combinations of categories (descriptors) and logical connectors (AND;OR;NOT) and how this model could be compared to a free distributive lattice. Some years later, Calvin Mooers [105] would consider two spaces for this model, namely the space of prescriptions P (descriptors) and the space of all possible documents subsets as L = }(G).
He realised that L with the set inclusion operator was naturally a partially-ordered set (or poset) and that, under certain circumstances (actually when P = }(M)), P could also be modelled as such. With this, a retrieval system consists in a transformation T : P ! L that is able to take a prescription (query) into the largest subset of documents that satises it (see Figure 1.5).
It is important to note that Mooers did not describe an actual IR system, but a model for retrieval systems that would enable the comparison of dierent approaches. Then, we can observe that FCA is an instance of this model, where the transformation T is naturally represented by a Galois connection dened between }(G) and }(M) and where the concept lattice is an elegant solution for the spaces P and L as it represents them in an integrated manner. Particularly, when this Galois connection is dened in terms of the derivation operator (()0), FCA becomes an implementation of the Boolean IR model.

Enriching the description space through external knowledge sources

In the FCA-IR model explained in Section 1.4.1, attributes are descriptors obtained from the set of documents. As previously explained, this space (P) can be very large but other than that it can suer from other problems. For example, it can be non-representative of the document set by dierent reasons (poor document description, poor vocabulary, incompleteness, etc.). Regarding these issues, it may be useful to use an external knowledge source to complement document descriptions. For example, if we are interested in considering synonymia for indexing (e.g. relating documents referring to concept lattices and Galois lattices) we may use a thesaurus. In case we are interested in considering hierarchical relations (e.g. relating documents referring to monkeys with those referring to primates) we may use a taxonomy. On the other hand, if we are interested in considering logical implications (e.g. relating documents written by a French author to those written by a German author using the label European literature) we may use an ontology. With these concerns, in 1996 Carpineto and Romano proposed a modied version of the GALOIS system to include background information in the form of a thesaurus for document indexing using FCA [21]. The modication was made in the order relation between formal concepts (K) using the order between document descriptors (T ) induced by a thesaurus as follows: (A1;B1) K (A2;B2) () 8m2 2 B2; 9m1 2 B1 s:t: m1 T m2.

READ Eye adaptations to the environment in Parhyale

Information content as a semantic relation measure

We can measure the semantic relation between any two terms by using the measure known as Lin similarity [89]. This is a measure related to the amount of information carried by a term or a word within a context (piece of text, document or corpus). Consider for example that we have a medical document d1 indexed by two terms: arthroscopy26 and complication. In this scenario, our job is to nd similar documents to d1. In order to do so, we take one of the terms within d1 (let us pick the term arthroscopy) and look for other documents that contain that term. Consider that we have two of such documents, the rst containing terms arthroscopy and practice, and the second containing arthroscopy and infection. The question now is how to identify which of these documents is more similar to the original one. Intuitively, we may choose the one with the term infection which supposes a kind of complication in the context of a surgery such as an arthroscopy (indeed, a very serious complication), while the term practice is much more general, making the document with that term less similar to the original. This notion of information correlation between two terms, or between the information content shared by them in a given context, can be measured with Lin similarity which takes in consideration the actual frequency correlation in a text corpus as well as the commonalities those terms have in a lexical hierarchy (such as a dictionary). To formalize, given two terms m1 and m2, the Lin similarity between m1 and m2 is dened as: lin(m1;m2) = 2 log p(ms) log p(m1) + log p(m2).

The principles of Concept Lattice-based Ranking

Section 1.4.4 presents the state-of-the-art w.r.t. to FCA-based IR approaches focusing in the automated retrieval of documents using ranking (i.e. not based on relevance feedback). In this section we provide further details on the technical aspects behind these works. In what follows, we will re-use the example provided in Section 1.4.1.
Consider the formal context in Table 1.8 and the user query qi containing terms arthroscopy and complication (hereafter we refer to the terms within a query as keywords). As discussed in Section 1.4.4, the typical FCA-based IR model considers the representation of the query as a virtual formal concept within the concept lattice. For this purpose, let us dene a virtual object that can be included in the formal context as any other object. Thus, the original formal context is redened to include the query q = (qe; qi), where qe is the virtual object and qi M contains its keywords (i.e. the constraints associated to the query). The new formal context is denoted as Kq = (G [ fqeg; M; I [ f(qe; m)8m2qig) and its associated concept lattice is computed using a FCA algorithm.
The concept lattice computed for the formal context of Table 1.8 (including the query) is illustrated in Figure 2.228. After constructing the lattice, the standard procedure in the CLR family approaches is to nd the object concept of the virtual object qe. This concept is usually called the query concept and it is the starting point to nd documents satisfying a query in the lattice [97, 100, 24].

Table of contents :

Chapter 1 Theoretical Background & State-of-the-art
1.1 Introduction
1.2 Formal Concept Analysis
1.2.1 Implications, Denitions and Association Rules
1.2.2 Conceptual Stability
1.2.3 Many-valued Contexts
1.2.4 Conceptual Scaling
1.2.5 Calculating a concept lattice
1.2.6 Pattern structures
1.3 Information Retrieval
1.4 A survey on Formal Concept Analysis based Information Retrieval approaches
1.4.1 Pre-FCA history – A lattice to model the description and document spaces
1.4.2 FCA meets IR
1.4.3 Enriching the description space through external knowledge sources
1.4.4 Relevance Feedback and Automatic Retrieval
1.4.5 Applications and Systems
1.5 Conclusions
Chapter 2 Contributions on Retrieval
2.1 Prologue
2.2 Introduction
2.3 Background
2.3.1 Information content as a semantic relation measure
2.3.2 The principles of Concept Lattice-based Ranking
2.4 CLAIRE – Concept Lattices for Information Retrieval
2.4.1 Motivation for a new approach for Information Retrieval based on Formal Concept Analysis
2.4.2 The principles of CLAIRE
2.4.3 The implementation of CLAIRE as a Knowledge Discovery in Databases process
2.4.4 Step 1 – Document Classication
2.4.5 Step 2 – Concept Lattice Navigation
2.4.6 Step 3 – Formal Concept Ranking
2.5 Experimental Evaluation
2.5.1 Experimental setting
2.5.2 Evaluation measures
2.5.3 Results
2.5.4 Query analysis
2.6 Conclusions
Chapter 3 Contributions on Indexing
3.1 Introduction
3.2 Background
3.2.1 The vector space model for retrieval (VSM)
3.2.2 Interval Pattern Structures
3.2.3 Relational Concept Analysis (RCA)
3.3 CLAIRE and the vector space model
3.3.1 Querying
3.3.2 Retrieving documents with ip-CLAIRE
3.3.3 Experimental results
3.3.4 Discussion on the capabilities of ip-CLAIRE
3.4 A model for heterogeneous indexing
3.4.1 Inspiring problem – Latent Semantic Indexing
3.4.2 Adapting RCA for pattern structures
3.4.3 Discussion on the heterogeneous pattern structures model
3.5 Conclusions
Chapter 4 Beyond indexing: Biclustering and its applications
4.1 Introduction
4.2 Background
4.2.1 Biclustering
4.2.2 The links between FCA and biclustering
4.2.3 Triadic Concept Analysis
4.3 Biclustering using FCA
4.3.1 Biclustering using partitions
4.3.2 Formalizations
4.3.3 Partition Space and Partition Pattern Structures
4.3.4 Experiments
4.3.5 Discussion on Biclustering and FCA
4.4 FCA and Biclustering: Two additional models
4.4.1 Scaled Partition Pattern Structures
4.4.2 Interval Pattern Structure Approach
4.4.3 Triadic Concept Analysis Approach
4.4.4 Experiments
4.4.5 Discussion
4.5 Applications
4.5.1 Recommender Systems
4.5.2 Functional Dependencies
4.6 Conclusions
Conclusions and perspectives
Bibliography