Efficient query answering in the presence of RDFS constraints

Get Complete Project Material File(s) Now! »

RDF Schema and entailment

A valuable feature of RDF is RDF Schema (RDFS) that allows enhancing the descrip-tions in RDF graphs. RDFS triples declare semantic constraints between the classes and the properties used in those graphs.
Figure 2.1 (bottom) shows the allowed constraints and how to express them; domain and range denote respectively the first and second attribute of every property. The RDFS constraints (Figure 2.1) are interpreted under the open-world assumption (OWA) [3].
More specifically, when working with the RDF data model, if the triples hasFriend rdfs:domain Person and Anne hasFriend Marie hold in the graph, then so does the triple Anne rdf:type Person. The latter is due to the rdfs:domain constraint in Figure 2.1.
RDF entailment. Implicit triples are an important RDF feature, considered part of the RDF graph even though they are not explicitly present in it, e.g., Anne rdf:type Person above. W3C names RDF entailment the mechanism through which, based on a set of explicit triples and some entailment rules, implicit RDF triples are derived. We denote by ‘iRDF immediate entailment, i.e., the process of deriving new triples through a single application of an entailment rule. More generally, a triple s p o is entailed by a graph G, denoted G ‘RDF s p o, if and only if there is a sequence of applications of immediate entailment rules that leads from G to s p o (where at each step of the entailment sequence, the triples previously entailed are also taken into account).
Example 2 (RDFS). Assume that the RDF graph G in Example 1 is extended with the following constraints.

Book rdfs:subClassOf Publication

The resulting graph is depicted in Figure 2.3. Constraints are represented by blue edges while its implicit triples are those depicted by gray dashed-line edges.
Saturation. The immediate entailment rules allow defining the finite saturation (a.k.a. clo-sure) of an RDF graph G, which is the RDF graph G1 defined as the fixed-point obtained by repeatedly applying ‘iRDF rules on G.
The saturation of an RDF graph is unique (up to blank node renaming), and does not contain implicit triples (they have all been made explicit by saturation). An obvious
connection holds between the triples entailed by a graph G and its saturation: G ‘RDF s p o if and only if s p o 2 G1.
RDF entailment is part of the RDF standard itself; in particular, the answers to a query posed on G must take into account all triples in G1, since the semantics of an RDF graph is its saturation [106].

BGP Queries

We consider the well-known subset of SPARQL consisting of (unions of) basic graph pattern (BGP) queries, modeling the SPARQL conjunctive queries. Subject of several recent works [48, 93, 49, 87], BGP queries are the most widely used subset of SPARQL queries in real-world applications [87]. A BGP is a set of triple patterns, or triples/atoms in short. Each triple has a subject, property and object, some of which can be variables.
Notations. In the following we use the conjunctive query notation q(x):- t1; : : : ; t , where ft1; : : : ; t g is a BGP; the query head variables x are called distinguished variables, and are a subset of the variables occurring in t1; : : : ; t ; for boolean queries x is empty. The head of q is q(x), and the body of q is t1; : : : ; t . We use x, y, z, etc. to denote variables in queries. We denote by VarBl(q) the set of variables and blank nodes occurring in the query q.
Query evaluation. Given a query q and an RDF graph G, the evaluation of q against G is: q(G) = fx j : VarBl(q) ! Val(G) is a total assignment such that t1 2 G; t2 2 G; : : : ; t 2 Gg

RDF

where we denote by t the result of replacing every occurrence of a variable or blank node e 2 VarBl(q) in the triple t, by the value (e) 2 Val(G).
Note that evaluation treats the blank nodes in a query exactly as it treats non-distinguished variables [4]. Thus, in the sequel, without loss of generality, we consider queries where all blank nodes have been replaced by (new) distinct non-distinguished variables.
Query answering. The evaluation of q against G uses only G’s explicit triples, thus may lead to an incomplete answer set. The (complete) answer set of q against G is obtained by the evaluation of q against G1, denoted by q(G1).
Example 3 (Query answering). The following query asks for the names of authors of books somehow connected to the literal 1949: q(x3):- x1 hasAuthor x2; x2 hasName x3; x1 x4 \1949″
Its answer against the graph in Figure 2.3
is q(G1) = fh\J L: Borges »ig. The answer results from G ‘RDF doi1 hasAuthor :b1 and the assignment = fx1 doi1; x2 :b1; x3 \J L: Borges »; x4 publishedIng. Observe that evaluating q directly against G leads to the empty answer, which is obviously incomplete.

Query answering techniques

Answering queries over data in the presence of deductive constraints requires a rea-soning step in order to compute complete query answers. Two main query answering techniques exist:
Saturation-based query answering. Compiles the constraints into the database by making all implicit data explicit. This method is straightforward and easy to implement. Its disadvantages are that dataset saturation requires computation time and storage space for all the entailed triples; moreover, the saturation must be recomputed upon every up-date. Incremental algorithms for saturation maintenance had been proposed in previous work [48]. However, the recursive nature of entailment makes this process costly (in time and space) and this method not suitable for datasets with a high rate of updates. Further, for some ontology languages saturation may be infinite (see 2.2).
Table 2.1, extracted from the cited work [48], presents the characteristics of well-known datasets and shows that saturation adds between 10% and 41% to the dataset size, while multiset-based saturation (required for the incrementally maintaining the saturation tech-nique presented by the authors) increase the size between 116% and 227%.
Reformulation-based query answering. Compiles the constraints into a modified query, which, evaluated over the explicit data only, computes all the answer due to ex-plicit and/or implicit data. The main advantage of this method is that its robust to update, there is no need to (re)compute the closure of the dataset. On the other hand, in general, reformulated queries are complex, hence inefficiently handled by modern query processing engines.

The database fragment of RDF

The database (DB) fragment of RDF [48] is, to the best of our knowledge, the most ex-pressive RDF fragment for which both saturation- and reformulation-based RDF query answering has been defined and practically experimented. This DB fragment is defined by:
Restricting RDF entailment to the RDF Schema constraints only (Figure 2.1), a.k.a. RDFS entailment. Consequently, the DB fragment focuses only on the ap-plication domain knowledge, a.k.a. ontological knowledge, and not on the RDF meta-model knowledge which mainly begets high-level typing of subject, property and object values found in triples with abstract RDF built-in classes, e.g., rdf:Resource, rdfs:Class, etc.
Not restricting RDF graphs in any way. In other words, any triple allowed by the RDF specification is also allowed in the DB fragment.
In this DB fragment of RDF, Saturation-based query answering amounts to precom-puting the saturation of a database db using its RDFS constraints in a forward-chaining fashion, so that the evaluation of every incoming query q against the saturation yields the correct answer set [48]: q(db1) = q(Saturate(db)). This technique follows directly from the definitions in Section 2.1 and Section 2.1.2, and the W3C’s RDF and SPARQL recommendations.
Reformulation-based query answering, in contrast, leaves the database db untouched and reformulates every incoming query q using the RDFS constraints in a backward-chaining fashion, Reformulate(q; db) = qref , so that the relational evaluation of this reformulation against the (non-saturated) database yields the correct answer set [48]: q(db1) = qref (db). The Reformulate algorithm, introduced in [49] and extended in [48], exhaustively applies a set of 13 reformulation rules. Starting from the incoming BGP query q to answer against db, the algorithm produces a union of BGP queries retrieving the correct answer set from the database, even if the latter is not saturated.

DL-LiteR

As commonly known, a Description Logic knowledge base (KB) K consists of a TBox T (ontology, or axiom set) and an ABox A (database, or fact set), denoted K = hT ; Ai, with T expressing constraints on A.
Most popular Description Logic dialects [13], and in particular DL-LiteR [32], build T and A from a set NC of concept names (unary predicates), a set NR of role names (binary predicates), and a set NI of individuals (constants). The ABox consists of a finite number of concept assertions of the form A(a) with A 2 NC and a 2 NI , and of role assertions of the form R(a; b) with R 2 NR and a; b 2 NI . The TBox is a set of axioms, whose expressive power is defined by the ontology language. DL-LiteR description logic [32], which is the first order logic foundation of the W3C’s OWL2 QL standard for managing semantic-rich Web data, is a significant extension of the subset of RDF (comprising RDF Schema) which can be translated into description logics, a.k.a. the DL fragment of RDF; DL-LiteR is also a fragment of Datalog [30].
Given a role R, its inverse, denoted R , is the set: f(b; a) j R(a; b) 2 Ag. We denote NR the set of roles made of all role names, together with their inverses: NR = NR [ fr j r 2 NRg. For instance, supervisedBy and supervisedBy , whose meaning is supervises, are in NR . A DL-LiteR TBox con-
straint is either:
(i) a concept inclusion of the form C1 v C2 or C1 v :C2, where each of C1; C2 is either a concept from NC , or 9R for some R 2 NR , and :C2 is the complement of C2. Here, 9R denotes the set of constants occurring in the first position in role R (i.e., the projection on the first attribute of R). For instance, 9supervisedBy is the set of those supervised by somebody, while 9supervisedBy is the set of all supervisors (i.e., the projection on the first attribute of supervisedBy , hence on the second of supervisedBy);
(ii) a role inclusion of the form R1 v R2 or R1 v :R2, with R1; R2 2 NR .
Observe that the left-hand side of the constraints are negation-free; in DL-LiteR, nega-tion can only appear in the right-hand side of a constraint. Constraints featuring negation allow expressing a particular form of integrity constraints: disjointness or exclusion con-straints. The next example illustrates DL-LiteR KBs.
Example 5 (DL-LiteR KB). Consider the DL-LiteR TBox T in Table 2.2 expressing contraints on the Researcher and PhDStudent concepts, and the worksWith and supervisedBy roles. It states that PhD students are researchers (T 1), researchers work with researchers (T 2)(T 3), working with someone is a symmetric relation (T 4), being supervised by someone implies working with her/him (T 5), only PhD students are supervised (T 6) and they cannot supervise someone (T 7).
Now consider the ABox A below, for the same concepts and roles:
(A1) worksWith(Ioana; Francois)
(A2) supervisedBy(Damian; Ioana)
(A3) supervisedBy(Damian; Francois)
It states that Ioana works with Franc¸ois (A1), Damian is supervised by both Ioana (A2) and Franc¸ois (A3).
The semantics of inclusion constraints is defined, as customary, in terms of their FOL interpretations. Tables 2.3 and 2.4 provide the FOL and relational notations expressing these constraints equivalently.

Table of contents :

1 Introduction
1.1 Big Data
1.2 Semantic Web
1.3 Motivations and studied problem
1.4 Contributions and outline
2 Preliminaries
2.1 RDF
2.1.1 RDF Schema and entailment
2.1.2 BGP Queries
2.1.3 Query answering techniques
2.1.4 The database fragment of RDF
2.2 DL-LiteR
2.2.1 Queries
2.2.2 Query answering techniques
3 Efficient query answering in the presence of RDFS constraints
3.1 Motivation
3.2 Efficient query answering
3.2.1 Cost model
3.2.2 Exhaustive query cover algorithm (ECov)
3.2.3 Greedy query cover algorithm (GCov)
3.3 Experimental evaluation
3.3.1 Settings
3.3.2 Optimized reformulation
3.3.3 Comparison with saturation
3.3.4 Experiment conclusion
3.4 Related work
3.5 Conclusion
4 Efficient FOL reducible query answering
4.1 Evaluating reformulated subqueries can be (very) hard
4.2 Optimization framework
4.3 Cover-based query answering in DL-LiteR
4.4 Cover-based query optimization in DL-LiteR
4.4.1 Safe covers optimization space
4.4.2 Generalized covers optimization space
4.4.3 Cost-based cover search algorithms
4.5 Experimental evaluation
4.5.1 Experimental settings
4.5.2 Our cost estimation function
4.5.3 Search space and EDL running time
4.5.4 Evaluation time of reformulated queries
4.5.5 Time-limited GDL
4.5.6 Experiment conclusions
4.6 Related work and conclusion
4.6.1 Relationship with prior work on reformulation-based query answering
4.6.2 Relationship with prior work on multi-query optimization
5 Conclusion and perspectives
5.1 Summary
5.2 Ongoing Work: Towards Scalable Hybrid Stores
5.2.1 Motivating Example and Challenges
5.2.2 Achitecture and state of advancement
5.2.3 Related Work on Hybrid Stores
5.3 Perspectives