Get Complete Project Material File(s) Now! »

## Tuple-Independent Databases

As we already mentioned in the introduction, one way to work with uncertain data would be to explicitly represent all possible states of the data (called the possible worlds) and associate to each world a probability value. The probability that a Boolean query satisfies a probabilistic database would then be the sum of the probabilities of the possible worlds that satisfy the query. The problem with this approach is that this is not a very compact way of storing the data, so that it is not feasible in practice to represent the data and query it in this way. Nevertheless, we can efficiently represent uncertain data if we make some independence assumptions: each tuple has a probability to be present or absent independently of the other tuples.

### Provenance and Knowledge Compilation Circuit Classes

Boolean circuits and functions. A (Boolean) valuation of a set V is a function : V ! {0, 1}. A Boolean function ‘ on variables V is a mapping that associates to each valuation of V a Boolean value in {0, 1} called the evaluation of ‘ according to . A (Boolean) circuit C = (G,W, goutput, μ) is a directed acyclic graph (G,W) whose vertices G are called gates, whose edges W are called wires, where goutput 2 G is the output gate, and where each gate g 2 G has a type μ(g) among var (a variable gate), NOT, OR, AND. The inputs of a gate g 2 G are the gates g0 2 G such that (g0, g) 2 W; the fan-in of g is its number of inputs. We require NOT-gates to have fan-in 1 and var-gates to have fan-in 0. The treewidth of C, and its size, are those of the graph (G,W). The set Cvar of variable gates of C are those of type var. Given a valuation of Cvar, we extend it to an evaluation of C by mapping each variable g 2 Cvar to (g), and evaluating the other gates according to their type. We recall the convention that AND-gates (resp., OR-gates) with no input evaluate to 1 (resp., 0). The Boolean function on Cvar captured by the circuit is the one that maps to the evaluation of goutput under . Two circuits are equivalent if they capture the same function.

#### Preliminaries on Probabilistic Graph Homomorphism

We first provide some formal definitions of the concepts we use in this chapter, and introduce the probabilistic graph homomorphism problem and the different classes of graphs that we consider.

Labels. Througout this chapter, will always denote a finite non-empty set of labels. When || > 1, we say that we are in the labeled setting; when || = 1, in the unlabeled setting. We will always consider that is fixed, i.e., will never be part of the input for problems that we consider.

Graphs and homomorphisms. We consider directed graphs with edge labels from , as defined in Section 1.1. We recall that we do not allow multi-edges: an edge e has a unique label (e). When || = 1, i.e., in the unlabeled setting, we simply write (V,E) for the graph and a ! b for an edge. A graph H0 = (V 0,E0, 0) is a subgraph of the graph H = (V,E, ), written H0 H, when we have V 0 = V , E0 E, and when 0 is |E0 , i.e., the restriction of to E0. (Note that, in a slightly non-standard way, we impose that subgraphs have the same set of vertices than the original graph; this will simplify some notation.) A graph homomorphism h from some graph G = (VG,EG, G) to some graph H = (VH,EH, H) is a function h : VG ! VH such that, for all (u, v) 2 EG, we have (h(u), h(v)) 2 EH and further H((h(u), h(v))) = G((u, v)). In other word, it is a homomorphism between G and H, when seen as relational instances. A match of G in H is the image in H of such a homomorphism h, i.e., the graph with vertices h(u) for u 2 VG and edges (h(u), h(v))) for (u, v) 2 EG. Note that two different homomorphisms may define the same match. Also note that two distinct nodes of G could have the same image by h, so a match of G in H is not necessarily homomorphic to G. We write G ; H when there exists a homomorphism from G to H. We call two graphs G and G0 equivalent if, for any graph H, we have G ; H iff G0 ; H. It is easily seen that G and G0 are equivalent if and only if G ; G0 and G0 ; G.

Probabilistic graphs. A probability distribution on graphs is a function Pr from a finite set W of graphs (called the possible worlds of Pr) to values in [0; 1] represented as rational numbers, such that the probabilities of all possible worlds sum to 1, namely, P H2W Pr(H) = 1.

A probabilistic graph is intuitively a concise representation of a probability distribution. Formally, it is a pair (H, ) where H is a graph with edge labels from and where is a probability function : E ! [0; 1] that maps every edge e of H to a probability (e), represented as a rational number. Note that each edge (u, v) in a probabilistic graph (H, ) is annotated both with a label ((u, v)) 2 , and a probability ((u, v)).

**Table of contents :**

Abstract

Remerciements

Table of Contents

General Introduction

Limits of Combined Tractability of PQE

Fixed-Parameter Tractability of Provenance Computation

From Cycluits to d-DNNFs and Lower Bounds

d-DNNFs for Safe Queries

Structure of the Thesis

**1 Background and General Preliminaries **

1.1 Relational Databases, Graphs, Hypergraphs

1.2 Query Languages

1.3 Tuple-Independent Databases

1.4 Complexity Classes

1.5 Trees, Treewidth, Pathwidth

1.6 Tree Automata and Tree Encodings

1.7 Provenance and Knowledge Compilation Circuit Classes

**2 Limits of Combined Tractability of PQE **

2.1 Introduction

2.2 Preliminaries on Probabilistic Graph Homomorphism

2.3 Disconnected Case

2.4 Labeled Connected Queries

2.5 Unlabeled Connected Queries

**3 Fixed Parameter Tractability of Provenance Computation **

3.1 Introduction

3.2 Approaches for Tractability

3.3 Conjunctive Queries on Treelike Instances

3.4 CFG-Datalog on Treelike Instances

3.5 Translation to Automata

3.6 Provenance Cycluits

3.7 Proof of Translation

**4 From Cycluits to d-DNNFs and Lower Bounds**

4.1 Introduction

4.2 Preliminaries on Tree Decompositions

4.3 Upper Bound

4.4 Proof of the Upper Bound

4.5 Application to PQE of CFG-Datalog

4.6 Lower Bounds on OBDDs

4.7 Lower Bounds on (d-)SDNNFs

4.8 Application to Query Lineages

Conclusion and Perspectives