Get Complete Project Material File(s) Now! »

## Fixed-Parameter Tractability of Provenance Com-putation

We now lower our expectations regarding the complexity in the query, since we realized that cases where PQE is tractable in combined complexity are very restricted. We would like to find a more expressive query language and classes of instances for which the combined complexity is tractable in the data and “reasonable” in the query, this being motivated by the fact that the data is usually much bigger than the queries. To reach this goal, we will first focus on the combined complexity of provenance computation on non-probabilistic relational databases. Indeed, a common technique to evaluate queries on probabilistic databases is the intensional approach: first compute a representation of the lineage (or provenance) of the query on the database, which intuitively describes how the query depends on the possible database facts; then use this lineage to compute probabilities eﬃciently. Hence in Chapter 3 of the thesis we will focus on the computation of provenance information, with the hope that the lineages that we will obtain can be used in the probabilistic case.

Our starting point is the work of Courcelle, as initially presented in [Courcelle 1990] and summarized in [Flum, Frick, and Grohe 2002], which shows that for any fixed Boolean MSO query Q and constant bound k, we can evaluate Q in linear time on instances whose treewidth is bounded by k. Intuitively, instances of bounded treewidth can be decomposed into small “parts” that are themselves connected following the structure of a tree (we talk of a tree decomposition). As we already mentioned, bounding the treewidth of instances is a natural criterion to ensure that many problems that are hard on arbitrary instances become tractable. The proof of Courcelle’s result is as follows: the query Q is translated into a tree automaton A, independently from the data but depending on the treewidth parameter k and the query. The instance I of treewidth 6 k is then transformed into what is called a tree encoding T , which is just a tree decomposition encoded with a finite alphabet so as to be processable by a tree automaton. The automaton A is designed to ensure that T is accepted by A iﬀ Q holds on I (written I |= Q), so that we can evaluate Q on I in linear-time data complexity. Moreover, [Amarilli, Bourhis, and Senellart 2015] shows how to use these automata to compute provenance circuits, again in linear-time data complexity. However, this tells little about the combined complexity. Indeed the complexity of computing the automaton is nonelementary in the MSO query, making it the bottleneck of the whole process. Hence the questions that we investigate in Chapter 3: Which queries can be eﬃciently translated into automata? For these queries, can we eﬃciently compute provenance information?

Results. We will use the framework of parameterized complexity theory [Flum and Grohe 2006] in order to better understand what makes the translation to automata intractable. Parameterized complexity theory is a branch of complexity theory introduced in the nineties by Downey and Fellows (see, e.g., [Downey and Fellows 1992]) to study the complexity of a problem depending on some input parameter. In our case, rather than restricting to a fixed class of “eﬃcient” queries, we study parameterized query classes, i.e., we define an eﬃcient class of queries for each value of the parameter. We further make the standard assumption that the signature is fixed; in particular, its arity is constant. This allows us to aim for low combined complexity for query evaluation, namely, fixed parameter tractability with linear-time complexity in the product of the input query and instance, which we call FPT-bilinear complexity.

The translation of restricted query fragments to tree automata on treelike instances has already been used in the context of guarded logics and other fragments, to decide satisfiability [Benedikt, Bourhis, and Vanden Boom 2016] and containment [Barceló, Romero, and Vardi 2014]. To do this, one usually establishes a treelike model property to restrict the search to models of low treewidth (but dependent on the formula), and then translate the formula to an automaton, so that the problems reduce to emptiness testing. Inspired by these fragments, we consider the language of clique-frontier-guarded Datalog (CFG-Datalog), and show an eﬃcient FPT-linear (in the query) translation procedure for this language, parameterized by the body size of rules. This implies FPT-bilinear combined complexity of query evaluation on treelike instances (parameterized by the body size and the treewidth bound). We show how the tractability of this language captures the tractability of such query classes as two-way regular path queries [Barceló 2013] and α-acyclic conjunctive queries. Regular path queries are an important building block of query languages for graph databases, while the class of α-acyclic queries [Yannakakis 1981] is the main known class of conjunctive queries to enjoy tractable combined complexity on arbitrary instances.

Instead of the bottom-up tree automata used in [Courcelle 1990; Flum, Frick, and Grohe 2002], and later in [Amarilli, Bourhis, and Senellart 2015] (which extended the result of Courcelle to probabilistic evaluation, but still in data complexity), we use the formalism of alternating two-way automata. These automata are more succinct than bottom-up automata, hence the translation to them is more eﬃcient. In fact, we prove that bottom-up tree automata are not succinct enough even to translate α-acyclic conjunctive queries eﬃciently.

We can use our CFG-Datalog language to understand what make the translation to automata eﬃcient for conjunctive queries. For conjunctive queries, we show that the treewidth of queries is not the right parameter to ensure eﬃcient translation: we prove that bounded-treewidth conjunctive queries cannot be eﬃciently translated to automata at all, so we cannot hope to show combined tractability for them via automata methods. By contrast, CFG-Datalog implies the combined tractability of bounded-treewidth queries with an additional requirement (interfaces between bags must be clique-guarded), which is the notion of simplicial decompositions previously studied by Tarjan in [Tarjan 1985]. CFG-Datalog can be understood as an extension of this fragment to disjunction, clique-guardedness, stratified negation, and inflationary fixpoints, that preserves tractability.

We then focus on the computation of provenance information for our Datalog fragment on bounded-treewidth instances. As we mentioned, [Amarilli, Bourhis, and Senellart 2015] showed how to compute provenance Boolean circuits from the bottom-up tree automata used by Courcelle. However, for our alternating two-way automata, the computation of provenance as Boolean circuits is not straightforward anymore. For this reason, we introduce a notion of cyclic provenance circuits, that we call cycluits. These cycluits are well-suited as a provenance representation for alternating two-way automata that encode CFG-Datalog queries, as they naturally deal with both recursion and two-way traversal of a treelike instance. While we believe that this natural generalization of Boolean circuits may be of independent interest, it does not seem to have been studied in detail. We show that cycluits can be evaluated in linear time. We further show that the provenance of an alternating two-way automata on a tree can be represented as a cycluit in FPT-bilinear time, generalizing results on bottom-up automata and circuits from [Amarilli, Bourhis, and Senellart 2015].

This implies we can compute in FPT-bilinear time the provenance cycluit of a CFG-Datalog query on a treelike instance (parameterized by the query body size and instance treewidth). We will then be able to use these cycluits for probabilistic evaluation, following the intensional approach to PQE.

**From Cycluits to d-DNNFs and Lower Bounds**

In Chapter 4 of the thesis we will investigate the connections between various provenance representations that can be used by the intensional approach for PQE, and outline the consequences for Chapter 3. We recall that the intensional approach consists of two steps. First, compute a representation of the lineage (or provenance) of the query on the database. The provenance of a Boolean query on a database is a Boolean function with facts of the database as variables that describes how the query depends on these facts. It can be represented with any formalism that is used to represent Boolean functions: Boolean formulas, Boolean circuits, binary decision diagrams, etc. Second, compute the probability of this Boolean function, which is precisely the probability that the probabilistic database satisfies the query. This can be seen as a weighted version of model counting (here, counting the number of satisfying assignments of a Boolean function). In order for this second step to be tractable, the representation formalism cannot be arbitrary. The field of knowledge compilation studies (among others) which formalisms allow tractable weighted model counting, the properties of these formalisms and the links between them. Thus, to evaluate queries on probabilistic databases, we can use knowledge compilation algorithms to translate circuits (or even the cycluits that we computed in Chapter 3) to tractable formalisms; conversely, lower bounds in knowledge compilation can identify the limits of the intensional approach.

In this part of the thesis, we study the relationship between two kinds of tractable circuit classes in knowledge compilation: width-based classes, specifically, bounded-treewidth and bounded-pathwidth circuits; and structure-based classes, specifically, OBDDs (ordered binary decision diagrams [Bryant 1992], following a variable order) and d-SDNNFs (structured deterministic decomposable negation normal forms [Pi-patsrisawat and Darwiche 2008], following a v-tree). Circuits of bounded treewidth can be obtained as a result of practical query evaluation [Jha, Olteanu, and Suciu 2010; Amarilli, Bourhis, and Senellart 2015], whereas OBDDs and d-DNNFs have been studied to show theoretical characterizations of the query lineages they can represent [Jha and Suciu 2011]. Both classes enjoy tractable probabilistic compu-tation: for width-based classes, using message passing [Lauritzen and Spiegelhalter 1988], in time linear in the circuit and exponential in the treewidth; for OBDDs and d-SDNNFs, in linear time by definition of the class [Darwiche 2001]. Hence the question that we study in Chapter 4: Can we compile width-based classes eﬃciently into structure-based classes?

Results. We first study how to perform this transformation, and show correspond-ing upper bounds. Existing work has already studied the compilation of bounded-pathwidth circuits to OBDDs [Jha and Suciu 2012; Amarilli, Bourhis, and Senellart 2016]. Accordingly, we focus on compiling bounded-treewidth circuits to d-SDNNF circuits. We show that we can transform a Boolean circuit C of treewidth k into a d-SDNNF equivalent to C in time O(|C| × f(k)) where f is singly exponential. This allows us to be competitive with message passing (also singly exponential in k), while being more modular. Beyond probabilistic query evaluation, our result implies that all tractable tasks on d-SDNNFs (e.g., enumeration [Amarilli, Bourhis, Jachiet, and Mengel 2017] and MAP inference [Fierens et al. 2015]) are also tractable on bounded-treewidth circuits.

We can then use this result to lift the combined tractability of provenance computation for CFG-Datalog on bounded-treewidth databases to probabilistic evaluation. Indeed, the treewidth of the constructed provenance cycluit is linear in the size of the Datalog query. However, we cannot directly apply our transformation from bounded-treewidth Boolean circuits to d-SDNNFs, because here we have a cycluit. Hence the first step is to transform this cycluit into a circuit, while preserving

a bound on the treewidth of the result. We show that a cycluit of treewidth k can be converted into an equivalent circuit whose treewidth is singly exponential in k, in FPT-linear time (parameterized by k). We can then apply our transformation to this circuit, which in the end shows that we can compute a d-SDNNF representation of the provenance of a CFG-Datalog query Q of bounded rule-size on an instance I of bounded treewidth with a complexity linear in the data and doubly exponential in the query. This d-SDNNF then allows us to solve PQE for Q on I with the same complexity. While the complexity in the query is not polynomial time, we consider that it is a reasonable one, given that our language of CFG-Datalog is quite expressive (at least, much more so than the very restricted classes of CQs that we study in Chapter 2).

Second, we study lower bounds on how eﬃciently we can convert from width-based to structure-based classes. Our bounds already apply to a weaker formalism of width-based circuits, namely monotone conjunctive normal form (CNF) and disjunctive normal form (DNF) formulas of bounded width. We connect the pathwidth of CNFs/DNF formulas with the minimal size of their OBDD representations by showing that any OBDD for a monotone CNF or DNF must be of width exponential in the pathwidth of the formula, up to factors depending in the formula arity (maximal size of clauses) and degree (maximal number of variable occurrences). Because it applies to any monotone CNF/DNF, this result generalizes several existing lower bounds in knowledge compilation that exponentially separate CNFs from OBDDs, such as [Devadas 1993] and [Bova and Slivovsky 2017, Theorem 19]. We also prove an analogue for treewidth and (d-)SDNNFs: again up to factors in the formula arity and degree, any d-SDNNF (resp., SDNNF) for a monotone DNF (resp., CNF) must have size that is exponential in the treewidth of the formula.

To prove our lower bounds, we rephrase pathwidth and treewidth to new notions of pathsplitwidth and treesplitwidth, which intuitively measure the performance of a variable ordering or v-tree. We also use the disjoint non-covering prime implicant sets (dncpi-sets), a tool introduced in [Amarilli, Bourhis, and Senellart 2016; Amarilli 2016] and that can be used to derive lower bounds on OBDD width. We show how they can also imply lower bounds on d-SDNNF size, using the recent communication complexity approach of [Bova, Capelli, Mengel, and Slivovsky 2016].

We then apply our lower bounds to intensional query evaluation on relational databases. We reuse the notion of intricate queries of [Amarilli, Bourhis, and Senellart 2016], and show that d-SDNNF representations of the lineage of these queries have size exponential in the treewidth of any input instance. This extends the result of [Amarilli, Bourhis, and Senellart 2016] from OBDDs to d-SDNNFs. As in [Amarilli, Bourhis, and Senellart 2016], this result shows that, on arity-2 signatures and under constructibility assumptions, treewidth is the right parameter on instance families to ensure that all queries (in monadic second-order) have tractable d-SDNNF lineage representations.

### d-DNNFs for Safe Queries

As a last contribution, I have studied the connections between the intensional approach and the safe UCQs [Dalvi and Suciu 2012]. We make no assumption on the shape of the input data and only care about data complexity. As we already mentioned, when Q is a union of conjunctive queries (UCQ), a dichotomy result is provided by the celebrated result of [Dalvi and Suciu 2012]: either Q is safe and PQE is PTIME, or Q is unsafe and PQE is #P-hard. The algorithm to compute the probability of a safe UCQ exploits the first-order structure of the query to find a so-called safe query plan (using extended relational operators that can manipulate probabilities) and can be implemented within a RDBMS. This approach is referred to as extensional query evaluation, or lifted inference.

This is diﬀerent from the intensional query evaluation that we have followed so far, where one first computes a representation of the lineage/provenance of the query Q on the database I, and then performs weighted model counting on the lineage to obtain the probability. To ensure that model counting is tractable, we use the structure of the query to represent the lineage in tractable formalisms from the field of knowledge compilation, such as read-once Boolean formulas, free or ordered binary decision diagrams (FBDDs, OBDDs), deterministic decomposable normal forms (d-DNNFs), decision decomposable normal forms (dec-DNNFs), deterministic decomposable circuits (d-Ds), etc. The main advantage of this approach compared to lifted inference is that the provenance can help explain the query answer (this is also true for non-probabilistic evaluation, as provenance is defined for non-probabilistic data). Moreover, having the lineage in a tractable knowledge compilation formalism can be useful for other applications: we can for instance easily recompute the query result if the tuple probabilities are updated, or we can compute the most probable state of the database if we assume that the query is satisfied.

A natural question is to ask if the extensional and the intensional approaches are equally powerful. What we call the q9 conjecture, formulated in [Jha and Suciu 2013; Dalvi and Suciu 2012], states that, for safe queries, extensional query evaluation is strictly more powerful than the knowledge compilation approach. Or, in other words, that there exists a query which is safe (i.e., that can be handled by the extensional approach) whose lineages on arbitrary databases cannot be computed in PTIME in a knowledge compilation formalism that allows tractable weighted model counting (i.e., cannot be handled by the intensional approach). Note that the conjecture depends on which tractable formalism we consider. The conjecture has recently been shown in [Beame, Li, Roy, and Suciu 2017] to hold for the formalism of dec-DNNFs (including OBDDs and FBDDs), which captures the execution of modern model counting algorithms. Another independent result [Bova and Szeider 2017] shows that the conjecture also holds when we consider the formalism of d-SDNNFs. However the status of the conjecture is still unknown for more expressive formalisms, namely, d-DNNFs and d-Ds. Indeed, it could be the case that the conjecture does not hold for such expressive formalisms, which would imply that the reason why PQE is PTIME for safe queries is because we can build deterministic decomposable circuits in PTIME for them.

Results. I have investigated whether we can disprove the conjecture for the classes of d-DNNFs and d-Ds. More specifically, we focus on a class of queries (the so-called H-queries) that was proposed in [Jha and Suciu 2013; Dalvi and Suciu 2012] to separate the two approaches, and that was used to prove the conjecture for dec-DNNFs [Beame, Li, Roy, and Suciu 2017] and d-SDNNFs [Bova and Szeider 2017]. We develop a new technique to build d-DNNFs and d-Ds in polynomial time for the H-queries, based on what we call nice Boolean functions. Because we were not able to prove that this technique works for all safe H-queries, we test this technique with the help of the SAT solver Glucose [Audemard and Simon 2009] on all the H queries up to a certain size parameter, that we generated automatically (amounting to about 20 million nontrivial queries). We found no query on which it does not work, hence we conjecture that our technique can build d-Ds for all safe H-queries. Interestingly, we found a few queries for which we can build d-Ds with a single internal negation at the very top, whereas we do not know if we can build d-DNNFs.

In order to compute all these H-queries, we had to solve a task of independent interest, namely, computing explicitly the list of all monotone Boolean functions on 7 variables, up to equivalence. This task had previously been undertaken by Cazé, Humphries, and Gutkin in [Cazé, Humphries, and Gutkin 2013] and by Stephen and Yusun in [Stephen and Yusun 2014]. We reused parts of the code from [Cazé, Humphries, and Gutkin 2013] and confirmed the number of such functions: 490 013 148.

**Structure of the Thesis**

We first give technical preliminaries in Chapter 1. We then move on to the content of the thesis, starting with our work on the combined complexity of probabilistic graph homomorphism in Chapter 2. We continue in Chapter 3 with the evaluation of CFG-Datalog on bounded-treewidth non-probabilistic instances. Last, we show in Chapter 4 how to lift these results to probabilistic evaluation and also how to derive lower bounds on knowledge compilation formalisms. We then conclude.

The last contribution of my PhD research is not included in the thesis but was published at AMW’2018 [Monet and Olteanu 2018]. Chapters 2 to 4 of this thesis can be read independently, except for Section 4.5 in Chapter 4, which depends on Chapter 3.

#### Relational Databases, Graphs, Hypergraphs

Relational databases. A relational signature σ is a finite set of relation names (or relational symbols) written R, S, T , . . . , each with its associated arity written arity(R) ∈ N. The arity arity(σ) of σ is the maximal arity of a relation in σ. A σ-instance I (or σ-database, or simply instance or database when the signature is clear from context) is a finite set of facts (or tuples) on σ, i.e., R(a1, . . . , aarity(R )) with R ∈ σ (sometimes simply noted R(a)), and where ai is what we call an element. The active domain dom(I) consists of the elements occurring in I, and the size of I, denoted |I|, is the number of tuples that I contains.

Example 1.1.1. Table 1.1 shows an example of relational instance I on signature σ = {R, S, T } with arity(R) = arity(S) = 2 and arity(T ) = 3. The active domain of I is dom(I) = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11} and its size is |I| = 11. C

A subinstance of I is a σ-instance that is included in I (as a set of tuples). An isomorphism between two σ-instances I and I0 is a bijective function f : dom(I) → dom(I0) such that for every relation name R, for each tuple (a1, . . . , aarity(R)) ∈ dom(I)arity(R), we have R(a1, . . . , aarity(R)) ∈ I if and only if R(f(a01), . . . , f(a0arity(R))) ∈ I0. When there exists such an isomorphism, we say that I and I0 are isomorphic: intuitively, isomorphic databases have exactly the same structure and diﬀer only by the name of the elements in their active domains. A homomorphism between two σ-instances I and I0 is a function h : dom(I) → dom(I0) such that for every tuple R(a1, . . . , aarity(R)) ∈ I, we have R(h(a01), . . . , h(a0arity(R))) ∈ I0. Hence, an isomorphism is simply a homomorphism that is bijective and whose inverse is also a homomorphism.

Graphs. In this thesis we consider various definitions of graphs, depending on what we need:

• Let σ be a finite non-empty set of labels. A directed graph with edge labels from σ is a triple G = (V, E, λ) with V a set of vertices, with E ⊆ V 2 a set of R edges, and with λ : E → σ a labeling function. We often write a −→ b for an edge e = (a, b) with label λ(e) = R.

• A directed unlabeled graph G = (V, E) consists of a set V of vertices and a set E ⊆ V 2 of edges, which we write a → b . We can equivalently see a directed unlabeled graph as a directed labeled graph whose set of labels contains only one element.

• An undirected labeled (resp., unlabeled) graph is just like a directed labeled (resp., unlabeled) graph, except that we have a → b ∈ E iﬀ b → a ∈ E. In the unlabeled case, we sometimes write the edges {a, b} ∈ E.

Note that we do not allow multi-edges: there is at most one edge a → b between two elements a and b, and this edge has a unique label λ(e) in the case of a labeled graph. We will sometimes only write “graph” when the kind of graph is clear from context. All the graphs that we consider in this thesis are finite, meaning that V (hence E) is a finite set.

Observe that a directed labeled graph can be seen as a relational instance in which all relational symbols have arity 2. These databases are in fact called graph databases and play an important role in many applications (social media, transportation networks, etc.).

Hypergraphs. Hypergraphs are a generalization of undirected unlabeled graphs where an edge can contain more than two vertices. Formally, a hypergraph H = (V, E) consists of a set V of vertices and a set E ⊆ 2V of hyperedges (or simply edges) which are subsets of V . For a node v of H, we write E(v) for the set of edges of H that contain v. The arity of H, written arity(H), is the maximal size of an edge of H. The degree of H, written degree(H), is the maximal number of edges to which a vertex belongs, i.e., maxv∈V |E(v)|. As with graphs, all hypergraphs that we will consider are finite. Given a σ-instance I on signature σ, we define the hypergraph associated to I as the hypergraph H = (V, E), whose set of vertices V is dom(I) and such that for every subset e of dom(I), e is in E if and only if I has a fact containing exactly the elements of e.

**Table of contents :**

**General Introduction **

Limits of Combined Tractability of PQE

Fixed-Parameter Tractability of Provenance Computation

From Cycluits to d-DNNFs and Lower Bounds

d-DNNFs for Safe Queries

Structure of the Thesis

**1 Background and General Preliminaries **

1.1 Relational Databases, Graphs, Hypergraphs

1.2 Query Languages

1.3 Tuple-Independent Databases

1.4 Complexity Classes

1.5 Trees, Treewidth, Pathwidth

1.6 Tree Automata and Tree Encodings

1.7 Provenance and Knowledge Compilation Circuit Classes

**2 Limits of Combined Tractability of PQE **

2.1 Introduction

2.2 Preliminaries on Probabilistic Graph Homomorphism

2.3 Disconnected Case

2.4 Labeled Connected Queries

2.5 Unlabeled Connected Queries

**3 Fixed Parameter Tractability of Provenance Computation **

3.1 Introduction

3.2 Approaches for Tractability

3.3 Conjunctive Queries on Treelike Instances

3.4 CFG-Datalog on Treelike Instances

3.5 Translation to Automata

3.6 Provenance Cycluits

3.7 Proof of Translation

**4 From Cycluits to d-DNNFs and Lower Bounds **

4.1 Introduction

4.2 Preliminaries on Tree Decompositions

4.3 Upper Bound

4.4 Proof of the Upper Bound

4.5 Application to PQE of CFG-Datalog

4.6 Lower Bounds on OBDDs

4.7 Lower Bounds on (d-)SDNNFs

4.8 Application to Query Lineages

**Conclusion and Perspectives **

Summary

Open Questions and Directions for Future Work

Perspectives

Appendix: résumé en français

Limites de la tractabilité combinée de PQE

Tractabilité à paramètre fixé du calcul de provenance

Des cycluits aux d-DNNFs et bornes inférieures

Structure de la thèse

Self-References

Other References