RDF data model and SPARQL query language

Get Complete Project Material File(s) Now! »

RDF data model and SPARQL query language

We present the basics of the RDF graph data model , how they can be enriched with ontological knowledge using RDF Schema , how RDF en-tailment can be used to make explicit the implicit information RDF graphs encode , and finally, how they can be queried using the widely-considered SPARQL Basic Graph Pattern queries , a.k.a. SPARQL conjunctive queries.

RDF graphs

RDF is a W3C recommendation published first in 2004 [Res, 2004], and revised in 2014 [RDF, 2014a]. It defines abstract model of an RDF graph based on three types of values: IRIs (Internationalized Resource Identifiers), literals and blank nodes.

An IRI is a compact sequence of characters used to identify an abstract or a physical resource. The IRI standard [RFC, b] extends the previous Uniform Resource Identifier standard [RFC, a], which was limited to ASCII characters only. For example, any URL, like https://starwars.com/databank/luke-skywalker is a valid IRI representing the Star Wars character, Luke Skywalker. In the following, we will denote the set of all IRIS by I . To simplify IRIs, RDF allows defining namespaces within an RDF graph. A namespace is an abbreviation of the prefix of a IRI. For example, if we define the names-pace sw as an abbreviation of the IRI prefix https://starwars.com/databank/, then sw:luke-skywalker is equivalent to the full abovementioned IRI. The standard allows specifying a default namespace, for which there is no need to specify a prefix. When the default namespace is understood, a simple IRI such as :luke-skywalker can be under-stood to mean the full-length URI above. For readability, we will shorten it in our further examples into :Luke.

A literal is a string that represents a value. For example, a literal can represent a name, e.g., the string “Luke”, or a number, e.g., “5”. RDF enables to group values into data types like Integer, but in this thesis, for simplicity, we will only consider plain literals without data type; similarly, we will ignore the possible language tags attached to literals, e.g., “@fr”, “@en”. Hence, the literal set is the set of all strings; it will be denoted by L .

A blank node represents an anonymous resource, either a literal or an IRI. We will represent blank nodes using the prefix :. For example, :b is a blank node having the name b. The name of a blank node is a local identifier within the graph at hand. Hence blank nodes can be assimilated to database labeled nulls [Abiteboul et al., 1995, Goasdoue´ et al., 2013]. The set of blank nodes will be denoted by B.

Using the above sets of resources I ; L ; B, we can formalize the RDF triples which compose an RDF graph:
Definition 2.1 (RDF triple). An RDF triple (or triple in short) is a triple of RDF resources (s; p; o) belonging to the product: (I [B) I (L [I [B):
The resource s is the subject of this triple, p is its property and o is its object.
For example, the triple (:Luke; :firstName; “Luke”) states that Luke’s first name is Luke, while the triple (:Luke; :pilotOf; :bs) states that Luke is the pilot of something, represented the blank node :bs.
Definition 2.2 (RDF graph). An RDF graph is a set of RDF triples. Considering an RDF graph G, we denote by Val(G) the set of all values (IRIs, blank nodes and literals) occurring in G, and by Bl(G) its set of blank nodes.
Example 2.1 (Sample RDF graph). Let us consider a first RDF graph G1 stating that Luke uses something represented by the blank node :bd and both Luke and the thing represented by :bd are pilot of a thing represented by :bs.
:Luke :uses :bd
:Luke :pilotOf :bs
:bd :pilotOf :bs
In this example, :bd and :bs may represent distinct entities (e.g., a droid and a star-ship, respectively) or the same entity. The notion of homomorphism between RDF graphs allows characterizing whether an RDF graph simply entails, i.e., is more specific than or subsumed by, another.
Definition 2.3 (RDF graph homomorphism). Let G and G0 be two RDF graphs. A homo-morphism from G to G0 is a substitution ’ of Bl(G) by Val(G), and is the identity for the other G values (IRIs and literals), such that ’(G) G0, where ’(G) = f(’(s); ’(p); ’(o)) j ( s; p; o) 2 Gg.
From now on, we write G0 j=’ G to state that G0 simply entails G, as witnessed by the homomorphism ’ from G to G0.
Example 2.2 (RDF graph homomorphism). Let us consider the graph G1 from Exam-ple 2.1 and the novel graph G2 stating that Luke uses and is a pilot a self-driven spaceship (pilot of itself):

RDF Schema

RDF Schema (RDFS), which is part of the RDF standard [RDF, 2014c], introduces the notion of classes, which are groups of resources; classes are themselves resources. This standard also defines two namespaces: rdf and rdfs. The property rdf:type is used to type a resource, i.e., to express that a resource is an instance of a class. For example, the triple (:Luke; rdf:type; :Person) states that Luke is an instance of the class Person.

RDFS also defines four properties used to state constraints or relationships between classes and properties:
The property rdfs:subClassOf (abbreviated sc) is used to specify that a class is a subclass (specialization) of another;
rdfs:subPropertyOf (abbreviated sp) allows to state that a property is a subproperty (specialization) of another;
The properties rdfs:domain and rdfs:range (abbreviated -d and ,!r respectively) specify that resources appearing as the first (respectively, the second) argument of a property have a certain type.
The IRIs ; sc; sp; -d and ,!r are called the built-in properties. Table 2.1 sums up the short notations we adopt for these properties. We call RDFS properties the built-in properties except . A triple in which the property is an RDFS property is called a schema triple or, more precisely, an RDFS triple.

The RDFS ontology of a graph is the set of its RDFS triples:
Definition 2.4 (RDFS Ontology). The RDFS ontology of an RDF graph G is the set of schema triples contained in G. A graph is called an RDFS ontology if it contains only schema triples.

RDF entailment rules

The semantics of an RDF graph is given by the explicit triples it contains, as well as the implicit triples that can be derived from it using RDF entailment rules.
We assume given a set of variables V disjoint from the RDF resources I [ L [ B.
A triple pattern is a triple of values belonging to: (I [B[V) (I [V) (L [I [B[V):
A basic graph pattern (BGP) is a set of triple patterns. It generalizes the notion of RDF graph by also allowing variables in the subject, property and object positions.
For a BGP P, we note Var(P) the set of variables occurring in P and Bl(P) its set of blank nodes.

Definition 2.5 (BGP to RDF graph homomorphism). A homomorphism from a BGP P to an RDF graph G is a substitution ’ of Bl(P) [ Var(P) by Val(G) and is the identity elsewhere, such that ’(P) G with ’(P) = f(’(s); ’(p); ’(o)) j (s; p; o) 2 Pg.
We write G j=’ P to state that ’ is a homomorphism from P to G.
Note that blank nodes are processed as variables in the definition of homomorphism.
Below, we define a syntax for RDF entailment rules based on basic graph patterns.

Should we translate these rules into first-order logic, using a single ternary predicate to denote triples, we would obtain specific tuple-generating dependencies (TGDs) [Abiteboul et al., 1995] or existential rules, e.g., [Mugnier and Thomazo, 2014]. In particular, these rules allow one to assert the existence of unknown entities, thanks to existentially quantified variables in the head of TGDs / existential rules. However, the set of built-in RDF entailment rules that we consider next do not have this feature: these rules would be logically translated into range-restricted, or datalog, rules (in which variables that occur in a rule head also occur in the rule body and are universally quantified [Abiteboul et al., 1995]). In the next chapters, we mainly work with built-in RDF entailment rules, and extend some re-sults to more general RDF entailment rules in Section 4.6, which justifies the following definitions that go beyond built-in RDF entailment rules.

Definition 2.6 (RDF entailment rule). An RDF entailment rule r is of the form body(r) ! head(r), where body(r) and head(r) are basic graph patterns, containing no blank node, respectively called body and head of the rule r.

The set of built-in RDF entailment rules is defined in [RDF, 2014b]. These rules produce implicit triples by exploiting the RDFS ontology of an RDF graph. In this thesis, we consider the rule set defined in Table 2.2, denoted by RRDFS; in the table, all values except RDFS properties denote variables. For example, for Rule rdfs2, we have: body(rdfs2) = f(p; -d; c); (s; p; o)g head(rdfs2) = f(s; ; c)g where p; c; s and o are variables. This rule specifies that the subject of a triple belongs to the domain of the triple property.

We define how RDF entailment rules directly entail implicit triples from explicit ones.

BGP Queries

A popular fragment of the SPARQL query language for RDF graphs is that of basic graph pattern queries, also called the SPARQL conjunctive queries. A basic graph pattern query is defined as follows:

Definition 2.10 (BGP query). A BGP query (BGPQ) q is of the form q( x¯) P, where P is a BGP also denoted by body(q) and x¯ Var(P). The variables x¯ are called the answer variables of q and the arity of q is jx¯j. The other variables are the called existential variables of q.

Example 2.6 (BGP query). Consider the following BGP query: q(x; y) (x; y; z); (y; sp; :uses); (z; ; :bc)
It asks for the subject and the property of each triple such that the triple property is a subproperty of :uses and the triple object has a type. Actually, there are two ways of in-terpreting the blank node :bc: it can be seen either as an existential variable (here,“the triple object has a type”) or as a specific entity (here, “the triple object has type :bc”).
These two ways of understanding a blank node correspond to two variants of query eval-uation: the standard evaluation treats blank nodes like existential variables (as per Definition 2.5), while the non-standard evaluation treats them like IRIs and literals.

Partially instantiated BGPQs
Partially instantiated BGPQs are a slight generalization of BGPQs, considered in the con-text of reformulation-based query answering [Goasdoue´ et al., 2013, Bursztyn et al., 2015]. Starting from a BGPQ q, partial instantiation replaces some variables with values from I [ L [ B, as specified by a substitution ; the partially instantiated query is denoted q . Observe that when = ;, q coincides with q. Further, due to , and in contrast with standard BGPQs, some answer variables of q can be bound:

Example 2.7 (Partially instantiated BGPQ). Consider the BGPQ asking for who is using which kind of object: q(x; y) (x; :uses; z); (z; ; y); (y; sc; :Object), and the substitution = fx 7!:Lukeg. The partially instantiated BGPQ q is: q (:Luke; y) (:Luke; :uses; z); (z; ; y); (y; sc; :Object).

BGPQ standard evaluation
The evaluation of a (partially instantiated) BGPQ is defined in terms of the homomor-phisms that exist between its BGP body (Definition 2.5) and the interrogated RDF graph, i.e., in terms of the possible matches of the BGP onto the RDF graph explicit triples.

Definition 2.11 (Standard BGPQ evaluation). The answer set to a partially instantiated
BGPQ q on an RDF graph G is: q (G) = f’( x¯ ) j G j=’ body(q) g
where x¯ and body(q) denote the result of replacing the variables in x¯ and body(q), respectively, according to .
If x¯ = ;, q is a Boolean query and the answer to q is false when q(G) = ; and true when q(G) = fhig.
We remark that the evaluation of BGPQs allows blank nodes from the RDF graph to appear in the answers. Also, with the standard evaluation, blank nodes occurring in BGPQs have the same behaviour as existential variables.

Example 2.8 ( Standard BGPQ evaluation). Consider the query q from Example 2.6, and the graph Gex in Figure 2.1. The evaluation of q on the graph Gex yields the answer set:
q(Gex) = fh:Luke; :pilotOfi; h:Rey; :usesWeaponig
This evaluation is derived from ’1 and ’2, two homomorphisms from body(q) to Gex, defined as follows:
’1 = fx 7!:Luke; y 7!:pilotOf; z 7!:spaceship1g
’2 = fx 7!:Rey; y 7!:usesWeapon; z 7!:bls; :bc 7!:LightSaberg
Note that in the above standard BGPQ evaluation example, ’2 maps the blank node :bc to another value. We may want to avoid this behaviour of standard query evaluation, and consider the blank node :bc in the query q as a value from the graph Gex, i.e., a “constant” and not an existential variable; there is actually no way to write a simple query asking for the entities of type :bc in Gex. Considering blank nodes in queries as constants is enabled by non-standard query evaluation.

Non-standard query evaluation
Non-standard query evaluation relies on non-standard BGP to RDF graph homomor-phisms:

Definition 2.12 (Non-standard BGP to RDF graph homomorphism). A non-standard ho-momorphism from a BGP P to an RDF graph G is a substitution ’ of Var(P) by Val(G) and is the identity elsewhere, such that for any triple (s; p; o) 2 P, the triple (’(s); ’(p); ’(o)) is in G.
Based on this, we define:

Definition 2.13 (Non-standard evaluation). Let q be a partially instantiated BGPQ. The non-standard evaluation to q on an RDF graph G is: q (G) = f’( x¯ ) j G j=’ body(q) g where ’ is a non-standard homomorphism from body(q) to G. z}|{
We remark that for a BGPQ q that contains no blank node, q(G) is equal to q(G). We now consider the following example, where the query contains a blank node:

Example 2.9 (Non-standard evaluation). Consider again the BGPQ from the Example 2.6.
Its non-standard evaluation on Gex contains only one tuple: h:Luke; :pilotOfi.
In this thesis, we adopt non-standard BGPQ evaluation by default, hence the blank node in queries should always be considered as constants and not as existential variables. It is always possible to transform a query q used with standard evaluation into a query q0 used with non-standard evaluation in such a way that their evaluation coincide, z}|{ i.e., q0(G) is equal to q(G) for any G. The transformation simply replaces each blank node in q with fresh existential variable. For simplicity, we will not use the notation z}|{ anymore in the sequel of the text, i.e., q(G) will denote the non-standard evaluation of q on G. We will also use the expression “query evaluation” instead of “non-standard evaluation”, except when we want to emphasize the evaluation behaviour w.r.t. blank nodes in the query.

Minimisation of a union of BGPQs
We also consider unions of BGPQs, denoted by UBGPQs, in which all the BGPQs have the same arity. We extend this notion to unions of partially instantiated BGPQs, in which the BGPQs are associated with a (possibly empty) substitution. The evaluation of a UBGPQ on a graph G is the union of the evaluations of its BGPQs on G.
A union of BGPQs may contain some redundant BGPQs, in the sense that some BGPQ may be contained in another:

Definition 2.14 (BGPQ Containment). Given two BGPQs q1 and q2, we say that q1 is contained in q2, when for each RDF graph G, q1(G) q2(G).
So, in a union of BGPQs, removing a BGPQ contained in another BGPQ of the union does not change the evaluation of the union on any RDF graph. We say that a UBGPQ is minimal, when no such removal is possible:

Definition 2.15 (Minimal UBGPQs). A union of BGP queries is minimal, when none of its BGPQs is contained in another.
The notion of BGP to RDF (non-standard) homomorphism is naturally extended to that of a BGP to BGP (non-standard) homomorphism, by simply allowing variables in its range. Then a homomorphism h from a BGPQ q1( x¯) P1 to a BGPQ q2(¯y) P2 with jx¯j = jy¯j is a homomorphism h from P1 to P2 such that h( x¯) = y¯. This notion is in turn extended to partially instantiated BGPQs q01 and q02, respectively obtained from q1 and q2 by substitutions 1 and 2: a homomorphism h from q01 to q02 is a homomorphism from 1(P1) to 2(P2) such that h( 1( x¯)) = 2(¯y). It is well-known that q1 is contained in q2 if and only if there is a homomorphism h from q2 and q1, and this result extends to partially instantiated BGPQs.
Hence, an algorithm to produce minimal UBGPQs can be built on an algorithm check-ing the existence of a BGPQ to BGPQ homomorphism as follows: while there are q1 and q2 two distinct queries in the union such that q1 is contained in q2, remove q1 from the union.

Query answering

The answers of a BGPQ on a graph w.r.t. a set of RDF entailment rule is defined as the evaluation (Definition 2.11) of the query on the saturation of the graph by the RDF entailment rules at hand:
Definition 2.16 (BGPQ answers). The set of answers of a BGPQ q on an RDF graph G w.r.t. a RDF entailment rule set R, is denoted q(G; R) and is defined by: q(G; R) = q(GR)
With general RDF entailment rules, the problem of query answering is not decidable [Baget et al., 2011]. In contrast, if one uses the rule set RRDFS shown in Table 2.2, the query answering problem is decidable, since the saturation of a graph is always finite.
There are two main techniques for solving the query answering problem; one based on graph saturation and another based on query reformulation. Of course, these techniques only work under certain conditions on the query, the rules and the graph.
The saturation-based query answering technique is based on Definition 2.16. It assumes that the RDF graph is stored in a database system, capable of evaluating BGPQs.

The query answering process is then is divided into two steps :
1. Graph saturation. The saturation GR of the RDF graph G w.r.t. the RDF entail-ment rules R is computed and stored in the database system.
2. Query evaluation. Every incoming query q is evaluated on the saturation of the graph by the database system, i.e., the system computes q(GR).
The reformulation-based query answering technique is also composed of two steps:
1. Query reformulation. Every incoming query q is reformulated into a reformulated query Q, typically a union of BGPQs, using the ontology O of G and the rules R. Hence, the reformulation step is performed without accessing data triples of G,
2. Reformulation evaluation The evaluation of Q on the graph G is performed by a database system. It is guaranteed to return the answers of q on G w.r.t. R.

A reformulation Q of a query q w.r.t. an ontology O and a set of RDF entailment rules R is such that for any RDF graph G with ontology O, the reformulation-based technique for query answering is sound, i.e., Q(G) q(GR), and complete, i.e., q(GR) Q(G).

Data integration

The aim of data integration [Doan et al., 2012, Lenzerini, 2002, Abiteboul et al., 2011] is to provide a uniform interface to a multitude of data sources. This interface is a global schema (or mediated schema), i.e., a set of relations, on which a user poses queries. The relations of the global schema are connected by semantic mappings with the data sources. In particular, we will consider the case when the mappings are specified through view definitions [Halevy, 2000, Halevy, 2001].

There are two main approaches for answering queries on the global schema. The first one, called the warehousing approach, materializes the global schema data using data provided by the sources. Hence it reduces the problem of query answering to standard database query evaluation on the materialization. In contrast, in the mediation-based approach, data remains in the sources; the query on the global schema is rewritten into (sub-)queries on the sources, whose results may be processed and assembled within a mediator query engine. In this thesis, we mostly investigate the mediator approach.

Below, we first recall in Section 2.2.1 the theory of data integration, while Sec-tion 2.2.2 presents the Global As View integration framework and Section 2.2.3, the Local As View framework. Section 2.2.4 introduces the natural generalization of these frame-works: the Global Local As View integration.

Table of contents :

1 Introduction
2 Preliminaries
2.1 RDF data model and SPARQL query language
2.1.1 RDF graphs
2.1.2 RDF Schema
2.1.3 RDF entailment rules
2.1.4 BGP Queries
2.1.5 Query answering
2.2 Data integration
2.2.1 Theory of data integration
2.2.2 Global As View data integration
2.2.3 Local As View data integration
2.2.4 Global Local As View data integration
2.3 Summary
3 RDF query answering
3.1 Motivation and state of the art
3.1.1 RDF representations
3.1.2 Query answering techniques
3.1.3 RDF storage layouts
3.2 Complete RDFS query reformulation
3.2.1 Preliminaries: RDFS ontology and RRDFS rule set properties
3.2.2 Overview of the query reformulation technique
3.2.3 Reformulation rules associated with Rc
3.2.4 Reformulation algorithm associated with Rc
3.2.5 Reformulation with Ra
3.2.6 Reformulation with Rc [ Ra
3.2.7 Experiments
3.2.8 Reformulation for Ra-compliant graphs
3.3 RDF storage layouts for ecient query answering
3.3.1 Preliminaries
3.3.2 BGPQ answering on the T layout
3.3.3 BGPQ answering on the CP layout
3.3.4 BGPQ answering based on the TCP layout
3.3.5 Summary-based query pruning
3.3.6 Experimental evaluation
3.4 Summary
4 RDF integration of heterogeneous data sources
4.1 Motivation and state of the art
4.1.1 Mediator data models and query languages
4.1.2 Mapping Language
4.1.3 Contributions
4.2 RDF Integration Systems
4.2.1 RDF Integration System (RIS) Definition
4.2.2 Query answering problem
4.3 Query answering techniques on RDF Integration Systems
4.3.1 Materialization-based query answering strategies: MAT and MAT-CA
4.3.2 Rewriting-based query answering strategies: REW-CA, REW-C and REW
4.3.3 Rewriting fully-reformulated queries using LAV mappings: REWCA
4.3.4 Rewriting partially-reformulated queries using saturated LAV mappings: REW-C
4.3.5 Rewriting queries using saturated mappings and ontology LAV mappings: REW
4.3.6 Remarks on related techniques
4.3.7 Landscape of query answering strategies
4.4 A Platform for RDF Integration Systems: Obi-Wan
4.4.1 Query answering in Obi-Wan
4.4.2 Query rewriting and mediated plan optimizations
4.5 Experimental evaluation
4.5.1 Experimental scenarios
4.5.2 Query answering performance
4.6 Extending the framework to more general rules
4.6.1 Restricted RIS
4.6.2 Correctness of the Method
4.7 Summary
5 Conclusion and perspectives
Bibliography