Get Complete Project Material File(s) Now! »

## data graphs and the resource description frame-work

rdf data model This work targets directed graphs with labeled nodes and edges. In particular, we focus on data graphs described in the Resource Description Framework (RDF) [8], the World Wide Web Consortium (W3C) standard for representing Web data. We consider only well-formed RDF data, as per the RDF specification [8], defined over three pairwise disjoint sets: the set of Uniform Resource Identi-fiers (URIs) U, the set of typed or untyped literals (constants) L, and the set of blank nodes (unknown URIs or literals) B. Blank nodes are an essential feature of RDF that allows us to support unknown URI or literal tokens. They are conceptually similar to the labeled nulls or variables used in incomplete relational databases [15], as shown in [16].

rdf graphs An RDF graph is a finite set of triples of the form s p o, such that s 2 (U [ B), p 2 U, and o 2 (U [ B [ L). The triple s p o states that the subject s has the property p, and the value of the property is the object o. The RDF property rdf:type is used to attach types, i.e., classes, to an RDF node, which may have zero, one, or several types. Note here that an RDF type attached to an RDF node using the rdf:type property is an RDF node itself, whereas the types in typed literals are concatenated to their raw constant string. Also, since literals do not appear as subjects of rdf:type triples, they cannot be typed via the rdf:type property.

As our running example, Figure 3 shows a sample RDF graph G describing a university department including professors, graduate students, articles they wrote, and courses they teach and/or take. Here and in the sequel, RDF nodes shown in gray are classes (types). Further, RDF nodes whose labels appear enclosed in quotes, e.g., “d1”, are literals, whereas the others are URIs.

rdfs ontology We can enhance the resource descriptions com-prised in an RDF graph by declaring ontological constraints stating relationships among its types and properties. For instance, we may add to the graph G in Figure 3 the ontology O consisting of the triples listed in Table 1 to state that graduate students, respectively, professors are instructors, that anyone who takes a course is a student, what can be taken is a course, and that advising someone entails knowing them.

From ontological constraints and explicit triples, we may derive implicit triples. For instance, from G and O, it follows that p1 is of type Instructor, which is modeled by the implicit triple p1 rdf:type Instructor; we say this triple holds in G, even though it is not explicitly part of it. Other implicit triples obtained from this ontology based on G are: p2 rdf:type Instructor, p2 rdf:type Student, and p1 knows p2.

Given a graph G (which may include a set of ontological constraints, denoted O), we obtain the saturation of G by adding to G:

1. All the implicit triples derived from G and O, then

2. All the implicit triples derived from those in Step 1 and O, and so on

until a fixed point, denoted G¥, is reached.

We add these triples based on RDF entailment rules from the RDF standard. In this work, we consider the widely-used RDF Schema (RDFS) entailment rules. They exploit the simple RDFS ontology lan-guage based on the four standard properties illustrated above, which we denote subClass, subProperty, domain and range, to add new onto-logical constraints or facts. The saturation G¥ of a graph G comprising RDFS ontological constraints is finite [17], unique and can be com-puted in polynomial time, e.g., by leveraging a database management system [16]. Crucially, G¥ materializes the semantics of G.

terminology We call the triples from G whose property is rdf:type type triples, those whose property is among the standard four RDFS ones schema triples, and we call all the other data triples.

We say an RDF node is typed in G if the node is the sub-ject of at least one type triple in G. In the graph shown in Fig-ure 3, p1 rdf:type Pro f essor is a type triple, hence the node p1 is typed, Pro f essor rdfs:subClassOf Instructor is a schema triple, whereas p1 advises p 2 is a data triple.

Further, we say a URI from G is a class node if:

1. It appears as the subject or the object of an rdfs:subClassOf triple, or the object of an rdfs:domain triple or an rdfs:range triple; or

2. It appears as the object of an rdf:type triple; or

3. It appears as the subject of an rdf:type triple with the object rdfs:Class.

We call property node a URI appearing:

1. As the subject or the object of an rdfs:subPropertyOf triple, or as the subject of an rdfs:domain triple or an rdfs:range triple; or

2. As the subject of an rdf:type triple with object rdf:Property.

Together, the class and property nodes are the schema nodes; all non-schema nodes are data nodes. In Figure 3, Professor and GradStudent are class nodes. If we consider the aforementioned ontology O, takes, advises, and knows are property nodes. Finally, the a, p, c, and d RDF nodes are data nodes.

It is important to stress that not all nodes in an RDF graph are typed, e.g., this is only true for p1 and p2 in Figure 3. Further, some nodes may have several types, in particular due to saturation (e.g., p2 is of types GradStudent and Instructor) but not only, e.g., p1 could also be of type ForeignStudent, etc.

### queries on rdf graphs

We can query RDF graphs by means of SPARQL queries as defined in [9]. SPARQL is a declarative query language similar to SQL. A SPARQL query contains a graph pattern that is used for data selection. In this thesis, we focus on basic graph patterns (BGPs), i.e., sets of triple patterns. A query processing engine matches triples of a BGP against G. Each triple of the BGP may contain values (RDF nodes) or variables (strings preceded with the ? character). For example, the query below finds all instances of the Professor class for the graph in Figure 3.

SELECT ?instance {

?instance rdf:type Professor .

}

The result of the query is f(p1)g.

SPARQL queries may contain a prefix, which allows us to abbreviate URIs. For instance, common abbreviations recognized by SPARQL query processing engines, which are consistent with those we intro-duced in Section 2.1, are:

rdf http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs http://www.w3.org/2000/01/rdf-schema#

Using SPARQL, we can express more sophisticated queries such as aggregates. The following query finds the number of courses per instructor for the graph in Figure 3.

PREFIX re: <http://running.example/>

SELECT ?instructor COUNT(?course) AS ?numberOfCourses { ?instructor re:teaches ?course .

}

GROUP BY ?course

The result of the query is f(p2, 1), (p3, 2), (p4, 1)g. Note here that the results does not contain p1 because this RDF node does not have a value for the property teaches.

Hereafter, when we discuss queries issued against an RDF graph G, we refer to the semantics of SPARQL queries. As hinted in Section 2.1, a query processing engine assumes that the semantics of G (the dataset) is its saturation G¥ (materialized or not). For instance, consider the graph G in Figure 3 with the ontology O stated in Table 1 and the following query Q:

SELECT ?instance {

?instance rdf:type Instructor .

}

The result of Q(G) is empty, whereas the result of Q(G¥) returns nodes p1 and p2. Notice how this time, through the type inference, we find that there are two instructors. Interestingly, neither of the two queries about instructors finds all the p RDF nodes (fp1, p 2, p3, p 4g): we will explore this observation in Section 3.3.

Similarly, given that G¥ contains more data, its summary may be different from the one of the original graph G (more on this also in Chapter 3). Therefore, from now on, we consider SPARQL queries issued against G as queries on G¥ (unless explicitly stated otherwise).

**relational data warehouses and aggregation**

We recall here a set of classical notions from the relational data man-agement literature: Online Analytical Processing, aggregate queries, and aggregate lattices. Then, we recall one of the most efficient algo-rithms introduced in this relational setting for efficiently evaluating lattices of aggregate queries over large volumes of data that provides an efficient way of computing the result of a GROUP BY CUBE query over a multidimensional aggregate lattice. Even though this algorithm does not apply to the heterogeneous RDF graphs we consider in this thesis, we build upon it to devise our novel algorithm specifically for this setting (in Chapter 4).

online analytical processing (olap) Online analytical pro-cessing (OLAP, in short) is a major class of applications of structured database management systems, and it has contributed to making data management the multibillion-dollar industry it is nowadays. OLAP systems enable businesses to derive critical insights from old and new data characterizing the functioning of a real-life process by efficiently evaluating aggregate queries that compute a global view from millions of individual data points.

In a typical application, managers of a retail chain use OLAP to synthesize and then analyze the sales of various products in various locations, stores over different periods. In an academic setting, we may use OLAP to analyze students’ registration to individual courses, compare average grades across disciplines, etc.

facts, dimensions, and measures OLAP facilitates multidi-mensional analytics and is typically backed by a relational data ware-house (DW) [18]. Such a warehouse is a specially organized database, which features one or more fact tables. A fact table states facts, i.e., raw information about a specific topic, e.g., “a client buys a product in a given store at a given date and time”, or “a student takes a course in a certain term in a certain year”. Each fact is uniquely identified by a combination of dimensions, e.g., the client and the product in the former example, or the student, the course, the year, and the professor teaching the course that year, in the latter example.

Further, to each fact, there may be associated one or several measures, which depends on the aggregate query asked over the data warehouse. For instance, if the retail manager is interested in following product sales, the number of products bought in every sale is a good measure; instead, if we examine the financial aspect, the price at which the sale took place is the right measure to use. In the academic example, if we follow the attendance of a course, then each fact simply counts as 1; instead, if we study the average grades, the grade obtained by each student in each course is the good measure to use.

Last but not least, an aggregate query uses an aggregate function over the measure values corresponding to all the facts. In our example, one can count the sales or the students; sum the numbers of sold products or the number of student registrations across several years; average the grades of all students who took the same course at the same time, or even find the maximum/minimum grade given each year, etc.

multidimensional aggregate lattice Experience has shown that the OLAP analysis needed by executives is rarely satisfied by running one aggregate query. Instead, we need many queries to find interesting insights into the data. Further, many aggregate queries that are asked on a given data warehouse have things in common. For instance, one may require the GPA of students in a course by year when the course is given, or by year and student gender (to see if there is a significant difference in performance by gender), or by year and teaching assistant if we want to see whether there are strong differences across the different lab groups, etc.

To this end, as one of its core data structures, OLAP employs a mul-tidimensional data array called OLAP cube. The OLAP cube spans the data on its multiple dimensions and stores, in each multidimensional cell, a measure value. Typically, relational DWs organize the data in a star schema or a snowflake schema (a normalized star schema) with one or several fact tables. The performance of OLAP depends both on the data layout and the query evaluation algorithms. The latter may lead to expensive computation; thus, their design accounts for data volume and data complexity (in particular, the number of dimensions). Generally speaking, in a data warehouse where N dimensions have been identified, and if one fixes a measure and an aggregate func-tion, given that each subset of dimension can be used for aggregation, there are 2N possible ways to group the facts, thus, 2N aggregates to compute.

An important insight that has revolutionized the efficiency of OLAP processing is the following. Let S1 be a set of one or more dimensions, and S2 S 1 be a dimension set that is its strictly larger superset. Then:

Any group of facts in the aggregate query determined by the dimension set S1 is completely included in a group of facts in the aggregate query determined by S2; this suggests reusing the effort needed to group facts on S2 to group by S1 at a lower computational cost;

If the aggregate function is sum, count, min, or max, we can compute the aggregate function results for the aggregate query corresponding to S1, directly from the results corresponding to S2, with no need to consult the base data (the facts themselves) or their measure values; in the case of avg, we can reconstruct the aggregated result based on sum and count.

This has lead to the introduction of a so-called multidimensional aggregate lattice that compactly represents the 2N aggregates that can be obtained from N independent dimensions, and also captures how each of them can be computed based on the results of the other. The node at the bottom of the lattice (usually not shown) corresponds to the empty set of dimensions: data is not aggregated at all (this node corresponds to the raw set of facts). At the top of the lattice, data is aggregated by all the N dimensions. At each intermediary level, we have nodes whose parents in the lattice have one more dimension, and whose children have one dimension less.

classical one-pass lattice computation For efficiency, in relational DWs, nodes in a multidimensional lattice are often com-puted from one of their parents to reuse computation and limit the number of passes over the data. In some works, the reusing of the other aggregates’ results is sometimes referred to as “summarizabil-ity” [19], [20]. Among the existing algorithms [21], ArrayCube [22] computes the whole lattice in a single pass over the data. Given a set of N dimensions, a measure, and an aggregate function, it relies on an array representation of data and evaluates 2N nodes through a Minimum Memory Spanning Tree, as we recall below.

array representation of data The distinct values of each dimension are ordered, leading to a set of cells, each corresponding to a unique combination of indices of values along the N dimensions (axes). In Example 3, assuming nationality 2 {A, B, F, L, N}, gender 2 {F, M} and company/area 2 {A, D, M, N} (we denote initials of the respective values in Figure 4), the multidimensional space has 40 cells, e.g., in cell 0, nationality=A, gender=F and company/area=A; in cell 1, nationality=B, gender=F and company/area=A. Each cell of the N-dimensional array contains the value of the aggregated measure over all facts in that cell; in Example 3, this is the count of CEOs. Further, cells are grouped in partitions: each partition is a contiguous part of the array, containing the cells corresponding to a predefined number of distinct values along each dimension, e.g., if this is 2, the 40-cell array has 6 partitions. Figure 4(a) shows the array. Note that an initial pass over the data is required to bring it from the relational to the array representation.

minimum memory spanning tree (mmst) The dimensions in Example 3 determine the lattice in Figure 4(b). To evaluate all nodes, ArrayCube chooses, for each non-root node A, a parent node in the lattice to compute the aggregate in A from, hence forming a spanning tree covering all the nodes in the lattice. The memory needed to evaluate all the aggregates in one pass over the data depends on the ordering of dimensions, their numbers of distinct values, and the partition size. ArrayCube chooses the tree that minimizes the overall memory needed; this is called the Minimum Memory Spanning Tree, or MMST in short.

lattice computation The algorithm proceeds as follows. The MMST is instantiated, allocating to each node the required memory. Partitions are loaded from the array representation of data, one at a time, into the root of the MMST. The content of each cell in the root is propagated to the children and used to incrementally update the aggregated measures of all the nodes in the MMST. Once a partition is evaluated, each node checks if it is time to store its memory content to disk. For instance, after scanning partition P1 in Figure 4(a), the subarray with nationality 2 {A, B} and company/area 2 {A, D} is exhausted. Thus, the counts of CEOs with either of the two nationalities, and A or D company area are computed. Now, A6 (Figure 4(b)) can store its result to disk and reuse the memory in the subsequent computation. Similarly, once processed, the two subarrays of P2, nationality 2 {A, B}, and company/area 2 {M, N}; nationality 2 {A, B}, are exhausted, and both A6 and A5 can store their results to disk. A6 stores its result to disk after every partition, whereas A5 does this after every two partitions.

**Table of contents :**

**1 introduction **

1.1 Motivation and outline

1.2 Manuscript organization

**2 preliminaries **

2.1 Data graphs and the Resource Description Framework

2.2 Queries on RDF graphs

2.3 Relational data warehouses and aggregation

**3 rdf graph summarization **

3.1 Motivation

3.2 Background: quotient RDF graph summarization

3.2.1 Quotient graphs

3.2.2 Prior work

3.2.3 Limitations of the prior work

3.2.4 Notation

3.3 Data graph summarization

3.3.1 Data property cliques

3.3.2 Strong and weak node equivalences

3.3.3 Weak and strong summarization

3.4 Typed data graph summarization

3.4.1 Data-then-type summarization

3.4.2 Type-then-data summarization

3.5 Summarization of graphs with RDFS ontologies

3.5.1 Type-then-data summarization using most general types

3.5.2 Interactions between summarization and saturation

3.5.3 Shortcut results

3.5.4 Relationships between summaries

3.6 From summaries to visualizations

3.6.1 Leaf and type inlining

3.6.2 Summary statistics

3.6.3 Visualizing very large summaries

3.7 Summarization algorithms

3.7.1 Global data graph summarization

3.7.2 Incremental data graph summarization

3.7.3 Global and incremental typed graph summarization

3.7.4 Parallel summarization

3.7.5 Parallel data graph summarization

3.7.6 Parallel typed graph summarization

3.7.7 Apache Spark implementation specifics

3.8 Centralized summarization experiments

3.8.1 Centralized algorithms compared

3.8.2 Datasets

3.8.3 Summary size

3.8.4 Summarization time

3.8.5 Summary precision

3.8.6 Experimental conclusions

3.9 Parallel summarization experiments

3.9.1 Configuration

3.9.2 Speedup through parallelism

3.9.3 Scalability study

3.9.4 Experimental conclusions

3.10 Non-quotient RDF graph summarization

3.11 Conclusion

**4 discovering interesting aggregates in rdf graphs **

4.1 Motivation

4.2 Problem statement and notation

4.3 Overview of the approach

4.4 Lattice-based computation

4.4.1 Incorrectness in the RDF setting

4.4.2 MVDCube algorithm

4.5 Early-stop aggregate pruning

4.5.1 Early-stop principle

4.5.2 Estimating the interestingness score

4.5.3 Other interestingness functions

4.5.4 Plugging early-stop into MVDCube

4.6 Implementation details

4.6.1 Offline processing

4.6.2 Online processing

4.7 Experimental evaluation

4.7.1 Analysis of example results

4.7.2 Benefits of derived properties

4.7.3 Analysis of MVDCube against PGCube

4.7.4 Impact of early-stop on MVDCube

4.7.5 Scalability study

4.7.6 Experimental conclusions

4.8 Related work

4.9 Conclusion

**5 conclusion and perspectives **

5.1 Thesis conclusion

5.2 Future work perspectives

**bibliography**