A Query Engine for Probabilistic XML

Get Complete Project Material File(s) Now! »

Probabilistic XML Query Lineage: Boolean Projection of the Query

A Boolean query over a deterministic document returns either true or false. A Boolean query 𝑄 over a p-document P returns the probability that 𝑄 is true in P, that is, the sum of the probabilities of all possible documents of P where 𝑄 is true. A fundamental observation [50], due to the locally monotone nature of the query language, is that the probability of a query over a p-document is the probability that one of the matches of the query over the underlying deterministic document remains in P. We will use this fact to transform our probabilistic XML querying problem into a probability computation problem over a propositional formula in disjunctive normal form (DNF).

Complexity of Query Evaluation: Problem Statement

It is easy to see that any DNF formula can be generated as the lineage of even a trivial XPath query such as //A [34]. The probability 𝑝 of a propositional formula 𝜙 in DNF denotes the probability that a random, independent, truth assignment to the variables (or events) succeeds to satisfy the formula 𝜙. If there are 𝑛 Boolean variables, then there are 2𝑛 possible assignments.
The problem of counting the exact number of solutions to a DNF or its probability is known to be #P-hard in general, even when all events have the same probability [56]. This implies in particular that there is no polynomial time algorithm for the exact solution if 𝑃 ̸= NP. Note that #P is an intractable complexity class which contains NP [56].
In the context of a p-document of PrXMLcie , this enumeration of assignments is conceptually equivalent to an evaluation of queries in each possible world or ordinary document, of the p-document following the distribution of events. The sum of probabilities of possible worlds {P1, …,P𝑛} that satisfy the query (i.e., a clause in the probabilistic lineage DNF is satisfied), is the probability value that answers the stated problem.
However, evaluation of DNF probability is commonly resolved by an approximation scheme, especially when an estimation of this quantity is sufficient in practice. For example, an additive approximation following a basic Monte Carlo sampling can be very effective and leads to good accuracy. Nevertheless, the na¨ıve Monte Carlo performs badly for the case where the number of satisfying assignments to a DNF is exponentially small, with respect to the total (huge) number of possible assignments. From an approximation point of view, we would need exponentially many samples from the set of possible assignments to get a given number of successes.
Instead of picking assignments uniformly at random and testing each clause, we could sample from the set of assignments that satisfy at least one clause (but not uniformly). Many strategies for sampling only the important space were introduced from which we cite the algorithms of Karp, Luby, and Madras in [32]. Also, when the number of literals per clause is constant, Ajtai and Wigderson [7] provide an (𝜀, 𝛿 = 0) approximation algorithm that has a run time polynomial in the approximation parameters and the input DNF length. Given a tolerated error 𝜀 on the result, and a reliability factor 𝛿 for the approximation, we can thus find in the literature a number of proposed fully-polynomial randomized approximation schemes, or FPRAS, for the DNF probability problem. Still, a FPRAS will always have a run time polynomial in the DNF size, the error (1/𝜀), and, the reliability factor (log 1/𝛿). This fact leaves the problem open for big DNFs, where highly correlated clauses may even break out more difficulties.

Analyzing and Decomposing the DNF

Our compilation tree is based on three operations (independence detection, factorization, inconsistency detection) that are repeatedly applied on the original formula. All three operations yield trees whose leaves are formulas in DNF and whose internal nodes are Boolean operations (independent disjunction ∨, independent conjunction ∧, and mutually exclusive disjunction +) that support efficient probability computations.

Table of contents :

Introduction
1 Preliminaries
1.1 Probabilistic XML
1.1.1 Probabilistic XML Models
1.1.2 From PrXMLmux,ind to PrXMLcie
1.2 Querying Probabilistic XML
1.2.1 Supported Queries
1.2.2 Probabilistic XML Query Lineage: Boolean Projection of the Query
1.2.3 Probabilistic XML Query Lineage: Set of Answers
1.2.4 Complexity of Query Evaluation: Problem Statement
2 Related Work
2.1 Relational Probabilistic Databases
2.1.1 SPROUT in MayBMS
2.2 Probabilistic XML
2.2.1 EvalDP
2.2.2 DNF Probability Estimation
2.2.3 Systems
3 Algorithms and Cost Models
3.1 Exact Computation
3.1.1 Possible Worlds (Na¨ıve Algorithm)
3.1.2 Inclusion–Exclusion (Sieve)
3.2 Approximation Algorithms
3.2.1 Additive Monte-Carlo
3.2.2 Multiplicative/Biased Monte-Carlo
3.2.3 Self-Adjusting Coverage Algorithm
4 Computation Strategies
4.1 Compiling The Probabilistic Lineage
4.1.1 Analyzing and Decomposing the DNF
4.1.2 From a Compilation Tree to an Evaluation Tree
4.2 Propagation of Approximation Parameters
4.2.1 Propagation Between Approximated Nodes
4.2.2 Propagation Between Exact and Approximated Nodes
4.2.3 Combining Additive and Multiplicative Approximations
4.2.4 Propagation of Error Bounds: Conclusions
4.2.5 Propagation of Approximation Guarantee 1 − 𝛿
4.3 Exploring the Space of Possible Evaluation Plans
4.3.1 A Randomized Approach
4.3.2 The Deterministic Case
4.3.3 Deterministic vs. Randomized Exploration
5 ProApproX: A Query Engine for Probabilistic XML
5.1 System Overview
5.1.1 Implementation
5.1.2 Query Processing
5.2 ProApproX 2.0
5.3 Earlier Version
6 Experiments 61
6.1 Experimental Setup
6.1.1 Data and Queries
6.1.2 Methodology
6.2 Comparison with EvalDP
6.3 Performance over the Movies Dataset
6.4 Comparison with SPROUT
6.5 Performance over Synthetic Data
Conclusions
Bibliography