RDF integration of heterogeneous data sources

Get Complete Project Material File(s) Now! »

RDF data model and SPARQL query language

We present the basics of the RDF graph data model (Section 2.1.1), how they can be enriched with ontological knowledge using RDF Schema (Section 2.1.2), how RDF en-tailment can be used to make explicit the implicit information RDF graphs encode (Sec-tion 2.1.3), and finally, how they can be queried using the widely-considered SPARQL Basic Graph Pattern queries (Section 2.1.4), a.k.a. SPARQL conjunctive queries.

RDF graphs

RDF is a W3C recommendation published first in 2004 [Res, 2004], and revised in 2014 [RDF, 2014a]. It defines abstract model of an RDF graph based on three types of values: IRIs (Internationalized Resource Identifiers), literals and blank nodes.
An IRI is a compact sequence of characters used to identify an abstract or a physical resource. The IRI standard [RFC, b] extends the previous Uniform Resource Identifier standard [RFC, a], which was limited to ASCII characters only. For example, any URL, like https://starwars.com/databank/luke-skywalker is a valid IRI representing the Star Wars character, Luke Skywalker. In the following, we will denote the set of all IRIS by I . To simplify IRIs, RDF allows defining namespaces within an RDF graph. A namespace is an abbreviation of the prefix of a IRI. For example, if we define the names-pace sw as an abbreviation of the IRI prefix https://starwars.com/databank/, then sw:luke-skywalker is equivalent to the full abovementioned IRI. The standard allows specifying a default namespace, for which there is no need to specify a prefix. When the default namespace is understood, a simple IRI such as :luke-skywalker can be under-stood to mean the full-length URI above. For readability, we will shorten it in our further examples into :Luke.
A literal is a string that represents a value. For example, a literal can represent a name, e.g., the string “Luke”, or a number, e.g., “5”. RDF enables to group values into data types like Integer, but in this thesis, for simplicity, we will only consider plain literals without data type; similarly, we will ignore the possible language tags attached to literals, e.g., “@fr”, “@en”. Hence, the literal set is the set of all strings; it will be denoted by L .
A blank node represents an anonymous resource, either a literal or an IRI. We will represent blank nodes using the prefix :. For example, :b is a blank node having the name b. The name of a blank node is a local identifier within the graph at hand. Hence blank nodes can be assimilated to database labeled nulls [Abiteboul et al., 1995, Goasdoue´ et al., 2013]. The set of blank nodes will be denoted by B.
Using the above sets of resources I ; L ; B, we can formalize the RDF triples which compose an RDF graph:
Definition 2.1 (RDF triple). An RDF triple (or triple in short) is a triple of RDF resources (s; p; o) belonging to the product: (I [B) I (L [I [B):
The resource s is the subject of this triple, p is its property and o is its object.
For example, the triple (:Luke; :firstName; “Luke”) states that Luke’s first name is Luke, while the triple (:Luke; :pilotOf; :bs) states that Luke is the pilot of something, represented the blank node :bs.
Definition 2.2 (RDF graph). An RDF graph is a set of RDF triples. Considering an RDF graph G, we denote by Val(G) the set of all values (IRIs, blank nodes and literals) occurring in G, and by Bl(G) its set of blank nodes.
Example 2.1 (Sample RDF graph). Let us consider a first RDF graph G1 stating that Luke uses something represented by the blank node :bd and both Luke and the thing represented by :bd are pilot of a thing represented by :bs.
:Luke :uses :bd
:Luke :pilotOf :bs
:bd :pilotOf :bs
In this example, :bd and :bs may represent distinct entities (e.g., a droid and a star-ship, respectively) or the same entity. The notion of homomorphism between RDF graphs allows characterizing whether an RDF graph simply entails, i.e., is more specific than or subsumed by, another.
Definition 2.3 (RDF graph homomorphism). Let G and G0 be two RDF graphs. A homo-morphism from G to G0 is a substitution ’ of Bl(G) by Val(G), and is the identity for the other G values (IRIs and literals), such that ’(G) G0, where ’(G) = f(’(s); ’(p); ’(o)) j ( s; p; o) 2 Gg.
From now on, we write G0 j=’ G to state that G0 simply entails G, as witnessed by the homomorphism ’ from G to G0.
Example 2.2 (RDF graph homomorphism). Let us consider the graph G1 from Exam-ple 2.1 and the novel graph G2 stating that Luke uses and is a pilot a self-driven spaceship (pilot of itself):

RDF Schema

RDF Schema (RDFS), which is part of the RDF standard [RDF, 2014c], introduces the notion of classes, which are groups of resources; classes are themselves resources. This standard also defines two namespaces: rdf and rdfs. The property rdf:type is used to type a resource, i.e., to express that a resource is an instance of a class. For example, the triple (:Luke; rdf:type; :Person) states that Luke is an instance of the class Person.
RDFS also defines four properties used to state constraints or relationships between classes and properties:
The property rdfs:subClassOf (abbreviated sc) is used to specify that a class is a subclass (specialization) of another;
rdfs:subPropertyOf (abbreviated sp) allows to state that a property is a subproperty (specialization) of another;
The properties rdfs:domain and rdfs:range (abbreviated -d and ,!r respectively) specify that resources appearing as the first (respectively, the second) argument of a property have a certain type.
The IRIs ; sc; sp; -d and ,!r are called the built-in properties. Table 2.1 sums up the short notations we adopt for these properties. We call RDFS properties the built-in properties except . A triple in which the property is an RDFS property is called a schema triple or, more precisely, an RDFS triple.
The RDFS ontology of a graph is the set of its RDFS triples:
Definition 2.4 (RDFS Ontology). The RDFS ontology of an RDF graph G is the set of schema triples contained in G. A graph is called an RDFS ontology if it contains only schema triples.

RDF entailment rules

The semantics of an RDF graph is given by the explicit triples it contains, as well as the implicit triples that can be derived from it using RDF entailment rules.
We assume given a set of variables V disjoint from the RDF resources I [ L [ B.
A triple pattern is a triple of values belonging to: (I [B[V) (I [V) (L [I [B[V):
A basic graph pattern (BGP) is a set of triple patterns. It generalizes the notion of RDF graph by also allowing variables in the subject, property and object positions.
For a BGP P, we note Var(P) the set of variables occurring in P and Bl(P) its set of blank nodes. Definition 2.5 (BGP to RDF graph homomorphism). A homomorphism from a BGP P to an RDF graph G is a substitution ’ of Bl(P) [ Var(P) by Val(G) and is the identity elsewhere, such that ’(P) G with ’(P) = f(’(s); ’(p); ’(o)) j (s; p; o) 2 Pg.
We write G j=’ P to state that ’ is a homomorphism from P to G.
Note that blank nodes are processed as variables in the definition of homomorphism.
Below, we define a syntax for RDF entailment rules based on basic graph patterns.
Should we translate these rules into first-order logic, using a single ternary predicate to denote triples, we would obtain specific tuple-generating dependencies (TGDs) [Abiteboul et al., 1995] or existential rules, e.g., [Mugnier and Thomazo, 2014]. In particular, these rules allow one to assert the existence of unknown entities, thanks to existentially quantified variables in the head of TGDs / existential rules. However, the set of built-in RDF entailment rules that we consider next do not have this feature: these rules would be logically translated into range-restricted, or datalog, rules (in which variables that occur in a rule head also occur in the rule body and are universally quantified [Abiteboul et al., 1995]). In the next chapters, we mainly work with built-in RDF entailment rules, and extend some re-sults to more general RDF entailment rules in Section 4.6, which justifies the following definitions that go beyond built-in RDF entailment rules.
Definition 2.6 (RDF entailment rule). An RDF entailment rule r is of the form body(r) ! head(r), where body(r) and head(r) are basic graph patterns, containing no blank node, respectively called body and head of the rule r.
The set of built-in RDF entailment rules is defined in [RDF, 2014b]. These rules produce implicit triples by exploiting the RDFS ontology of an RDF graph. In this thesis, we consider the rule set defined in Table 2.2, denoted by RRDFS; in the table, all values except RDFS properties denote variables. For example, for Rule rdfs2, we have: body(rdfs2) = f(p; -d; c); (s; p; o)g head(rdfs2) = f(s; ; c)g
where p; c; s and o are variables. This rule specifies that the subject of a triple belongs to the domain of the triple property.
We define how RDF entailment rules directly entail implicit triples from explicit ones.
Definition 2.7 (Direct entailment). The direct entailment of an RDF graph G with a set of RDF entailment rules R, denoted by CG;R, characterizes the set of implicit triples resulting from triggering (a.k.a. firing) the rules in R using the explicit triples of G only. It is defined as: CG;R = f ’ (head(r)) safe ; =’ body(r) and there is no ’0 j 9r 2 R G j extension of ’ s.t. G j=’0 head(r)g
where ’(head(r))safe is obtained from ’(head(r)) by replacing each variable in Var(head(r))n Var(body(r)) by a fresh blank node. Note that the condition “there is no ’0 extension of ’ s.t. G j=’0 body(r) [ head(r)” prevents the production of an obviously redundant set of triples.
Without loss of generality, as in the RDF standard, we only consider well-formed entailed triples, i.e., from (I [ B) I (L [ I [ B).
Example 2.4 (Direct entailment). Consider G3, the graph of Example 2.1 extended with the ontology Oex of Example 2.3:
G3 = f(:Rey; :usesWeapon; :bs)g [ Oex as well as the rule rdfs7: ( p1; sp; p2); (s; p1; o) ! (s; p2; o)
The rule rdfs7 applies to G3, i.e., G3 j=’ body(rdfs7) through the homomorphism ’ defined as fp1 7!:usesWeapon; p2 7!:uses; s 7!:Rey; o 7! :blsg. The rules rdfs11, ext2, ext3 and ext4 also apply to G3.

Table of contents :

1 Introduction
2 Preliminaries
2.1 RDF data model and SPARQL query language
2.1.1 RDF graphs
2.1.2 RDF Schema
2.1.3 RDF entailment rules
2.1.4 BGP Queries
2.1.5 Query answering
2.2 Data integration
2.2.1 Theory of data integration
2.2.2 Global As View data integration
2.2.3 Local As View data integration
2.2.4 Global Local As View data integration
2.3 Summary
3 RDF query answering
3.1 Motivation and state of the art
3.1.1 RDF representations
3.1.2 Query answering techniques
3.1.3 RDF storage layouts
3.2 Complete RDFS query reformulation
3.2.1 Preliminaries: RDFS ontology and RRDFS rule set properties
3.2.2 Overview of the query reformulation technique
3.2.3 Reformulation rules associated with Rc
3.2.4 Reformulation algorithm associated with Rc
3.2.5 Reformulation with Ra
3.2.6 Reformulation with Rc [ Ra
3.2.7 Experiments
3.2.8 Reformulation for Ra-compliant graphs
3.3 RDF storage layouts for ecient query answering
3.3.1 Preliminaries
3.3.2 BGPQ answering on the T layout
3.3.3 BGPQ answering on the CP layout
3.3.4 BGPQ answering based on the TCP layout
3.3.5 Summary-based query pruning
3.3.6 Experimental evaluation
3.4 Summary
4 RDF integration of heterogeneous data sources
4.1 Motivation and state of the art
4.1.1 Mediator data models and query languages
4.1.2 Mapping Language
4.1.3 Contributions
4.2 RDF Integration Systems
4.2.1 RDF Integration System (RIS) Definition
4.2.2 Query answering problem
4.3 Query answering techniques on RDF Integration Systems
4.3.1 Materialization-based query answering strategies: MAT and MAT-CA 85
4.3.2 Rewriting-based query answering strategies: REW-CA, REW-C and REW
4.3.3 Rewriting fully-reformulated queries using LAV mappings: REWCA
4.3.4 Rewriting partially-reformulated queries using saturated LAV mappings: REW-C
4.3.5 Rewriting queries using saturated mappings and ontology LAV mappings: REW
4.3.6 Remarks on related techniques
4.3.7 Landscape of query answering strategies
4.4 A Platform for RDF Integration Systems: Obi-Wan
4.4.1 Query answering in Obi-Wan
4.4.2 Query rewriting and mediated plan optimizations
4.5 Experimental evaluation
4.5.1 Experimental scenarios
4.5.2 Query answering performance
4.6 Extending the framework to more general rules
4.6.1 Restricted RIS
4.6.2 Correctness of the Method
4.7 Summary
5 Conclusion and perspectives
Bibliography