Get Complete Project Material File(s) Now! »

## A novel algorithm: the Provenance-Aware Chase & Backchase

We dedicate this section to showing a different and much more efficient approach of the backchase phase, and to presenting a high level overview of the resulting novel reformulation algorithm, the Provenance-Aware Chase & Backchase.

Indeed, we will sketch in the following (and demonstrate with our experimental evaluation) how, by our new strategy, the performance of a complete search for minimal reformulations can be significantly more improved than just by the naive bottom-up strategy and the corresponding pruning. The essential way of achieving such performance improvement is that of replacing the potentially exponential number of subquery chases with a single chase of the universal plan.

To ensure a sound and complete reformulation algorithm, this single chase should in turn be able to retain all the relevant effect of the individual subquery chases. To achieve such behaviour, we will thus instrument this chase with provenance annotations, whose final purpose will be to reflect the minimal reformulations, that is, the subqueries of the universal plan that turn out to be (minimally) equivalent to Q. The ability to maintain and propagate in an unexpensive fashion such provenance information during a single chase of the universal plan, would then spare the exponentially many chases of its subqueries, which constitute the performance issue in the C&B. By design, such approach will also avoid the fruitlessness and redundancy in subquery chases.

### The Provenance-Aware Chase

We will describe in this subsection the Provenance-Aware Chase, further denoted pa_chase. As is the case for the Standard Chase and the Conservative Chase, the pa_chase is an iterative procedure consisting in a sequence of steps. As announced in Section 1.2, the pa_chase essentially consists in instrumenting the cs_chase with provenance. pa_chase steps will thus take as input a provenance-adorned sk_body and an sk_unit_constraint 6 and yield as output a provenance-adorned sk_body.

#### Provenance formulae and provenance-adorned sk_bodies

To describe provenance-adorned sk_bodies, we will first introduce the concept of provenance formulae and their associated operations.

Definition 1.3.63 (Provenance formula). Given a finite set of symbols P, called a provenance vocabulary, a provenance formula over P is either

• True, or

• False, or

• a boolean formula in DNF over the provenance symbols, using the *(AND) and +(OR) operators: F = C1 + … + Cn, where Ci = S1i ∗ … ∗ Sni i , Sj i ∈ P and Ci is called a provenance conjunct. We will further call the symbols in P provenance terms.

We denote by ProvForms(P) the set of all provenance formulae over P.Note that we can view provenance conjuncts as subsets of P and provenance formulae as subsets of the power set of P, P(P), such that False = ∅ and True = P(P). We will hereafter use the standard set operations symbols with straightforward semantics for provenance formulae and provenance conjuncts.

**The Provenance-Aware Chase & Backchase**

We have presented in Section 1.2 an overall view of our reformulation algorithm, the Provenance- Aware Chase & Backchase (ProvC&B). Based on the concepts and statements of the previous subsections, we are now ready to detail this algorithm and prove its soundness and completeness. We start by detailing some terminology that allows us to fully explicitate ProvC&B. Universal plan and universal body. Given a query Q formulated against a source schema S, a set of constraints C and a target schema T, we denote by QC the result of a terminating standard chase sequence over Q with C.

We further denote by BU and call the universal body, the restriction of body(QC) to the target schema T (recall that this restriction means that we consider the maximal sub-body of body(QC) induced by the relational atoms in T). Note that BU is a closed body, that is, BU = BU. We will further call the query U = Query(Head(Q),BU) the universal plan. For every subquery sq of U we denote by Bsq its corresponding body. Note that Bsq ⊆ BU. We recall below the properties of the universal plan, restating them using bodies:

**Table of contents :**

Introduction

**1 A complete yet practical algorithm for finding minimal query reformulations under constraints **

1.1 Overview of the Chase & Backchase

1.2 A novel algorithm: the Provenance-Aware Chase & Backchase

1.3 Formal presentation and guarantees of ProvC&B

1.3.1 Preliminaries: atoms, queries and constraints

1.3.2 The Standard Chase

1.3.2.1 Bodies

1.3.2.2 Homomorphisms of bodies

1.3.2.3 Standard Chase steps and sequences

1.3.2.4 Properties of the Standard Chase

1.3.3 The Conservative Chase

1.3.3.1 Skolem terms, sk_bodies and sk_constraints

1.3.3.2 Homomorphisms of sk_bodies

1.3.3.3 Conservative Chase steps and sequences

1.3.3.4 Properties of terminating Conservative Chase sequences

1.3.3.5 The Conservative Chase and the Standard Chase

1.3.3.6 Termination of the Conservative Chase

1.3.3.7 Splitting sk_constraints into sk_unit_constraints

1.3.4 The Provenance-Aware Chase

1.3.4.1 Provenance formulae and provenance-adorned sk_bodies

1.3.4.2 Provenance-Aware Chase steps and sequences

1.3.4.3 The Provenance Pick, the Provenance-Aware Chase and the Conservative Chase

1.3.4.4 Termination of the Provenance-Aware Chase

1.3.4.5 The Provenance-Aware Chase and the Standard Chase

1.3.5 The Provenance-Aware Chase & Backchase

1.4 Implementation

1.5 Experiments

1.6 Mininum-cost reformulations with ProvC&B

1.6.1 Cost-based pruned Provenance-Aware Chase steps

1.6.2 Cost-based pruned ProvC&B

1.6.3 Initial experimental evaluation

1.7 Related work

**2 A theoretical and practical approach to finding XPath rewritings with a single-level of intersection of multiple views **

2.1 View-based rewritings

2.1.1 XP queries and tree patterns

2.1.2 XP∩−simple , XP∩ , DAG patterns

2.1.3 Pattern satisfiability, containment and equivalence

2.1.4 Interleavings

2.1.5 Union-freeness, dominant interleavings and DAG-tree equivalence

2.1.6 The view-based rewriting problem for XP∩

2.1.7 A sound and complete rewriting algorithm

2.1.8 Interesting XP fragments

2.2 Rewritings, equivalence and union-freeness

2.2.1 Rewritings and the DAG-tree equivalence

2.2.2 DAG-tree equivalence and union-freeness

2.3 A rule-based algorithm for directly constructing the dominant interleaving

2.3.1 Global flow of APPLY-RULES

2.3.2 The rewrite rules of APPLY-RULES

2.3.3 Complexity of APPLY-RULES

2.3.4 Using APPLY-RULES for union-freeness, equivalence and rewritings

2.4 Achieving PTIME completeness

2.4.1 Completeness in PTIME for XP∩−simple (XPes) DAGs

2.4.2 Completeness in PTIME for XPes queries

2.4.3 Completeness in PTIME for XP// akin patterns

2.5 Implementation and optimizations

2.6 Experiments

2.6.1 Documents, queries and views

2.6.2 REWRITE vs

2.6.3 Rewrite time

2.6.4 Evaluation time

2.6.5 Discussion

2.7 Related Work

Conclusions and future directions

**A Additional topics **

A.1 Efficient multi-dimensional indexing (the ACM SIGMOD Programming Contest 2012)

A.2 Web source selection for wrapper inference

**B Condensé de la thèse en français **

B.1 Introduction

B.2 Condensé du premier chapitre

B.3 Condensé du deuxième chapitre