Get Complete Project Material File(s) Now! »

## Small Deterministic Automata for Regular Path Queries

The only previous algorithms for CQA on streams without approximation [Gauwin et al. 2009] applies to queries defined by deterministic NWAs. The first reason why this algorithm could not be applied in practice is related to the determinization algorithm for NWAs from [Alur & Madhusudan 2009, Debarbieux et al. 2015], see also Section 2.4.3. The NWAs obtained from the navigational XPath queries of the XPathMark benchmark with the compiler from [Debarbieux et al. 2015, Sebastian 2016] are often not deterministic and can not be determinized with this algorithm neither. For simple queries of the benchmark, the determinization algorithm produces huge automata, while for others its termination could not be observed after few hours.

Our first contribution that we published in [Boneva et al. 2020] shows that all navigational forward XPath queries from the XPathMark benchmark can be compiled to small deterministic NWAs nevertheless. The trick is to use stepwise hedge automata (SHAs) as intermediates. These are a variant of stepwise tree automata [Carme et al. 2004] that we introduce. SHAs are a mixture of standard tree automata and finite state automata on words (Nfas), avoiding the trouble with the notion of determinism of the quite different notion hedge automata from [Comon et al. 2007, Murata 2000, Thatcher 1967] pointed out in [Martens & Niehren 2007] (see also Section 3.3.3). SHAs have a natural notion of bottom-up and left-to-right determinism, mixing the notions of bottom-up determinism of tree automata with the notion of left-to-right determinism for Nfas. In contrast to NWAs, however, they do not support any form of top-down determinism. It turns out that the determinization algorithm for NWAs from [Alur & Madhusudan 2009, Debarbieux et al. 2015] behaves very badly for NWAs that do nontrivial work in a top-down manner. What this means can be formalized syntactically by not having the single-entry property. In contrast, all NWAs obtained by compilation from SHAs have the single-entry property (up to minor details). A different compiler to that from [Debarbieux et al. 2015] for mapping path queries to NWAs can be obtained by compiling them in a first step to SHAs as intermediates, and then compiling the SHAs obtained to NWAs. The NWAs obtained this way have the single-entry property. Using this different compiler indeed solves the problem, since the the NWAs obtained thereby can be determinized in practice by the algorithm [Alur & Madhusudan 2009, Debarbieux et al. 2015]. An alternative solution can be obtained by determinizing the SHA obtained from the path queries directly, and then compiling them to NWAs while preserving the determinism. This alternative solution has the advantage that it permits to apply unique minimization to the deterministic SHAs, while unique minimization is not available for deterministic NWAs [Alur et al. 2005]. A further difficulty that we needed to overcome here is worth mentioning. The NWAs used by [Debarbieux et al. 2015] are symbolic, using descriptors for the complex labels of Xml nodes. These descriptors needed to eliminated before eterminization, increasing the size of the NWAs considerably. We show in this thesis that we can compile path queries to NWAs with else rules, by adapting the previous compiler. We then lift the determinization algorithm of [Alur & Madhusudan 2009, Debarbieux et al. 2015] to NWAs with else rules, thus avoiding any size increase all over. See Section 2.4.3 for further details.

### Efficient CQA Algorithm on Streams

The second problem with the approximation-free CQA algorithm on Xml streams from [Gauwin et al. 2009] is its high polynomial time complexity. Given a query defined by a deterministic NWA this CQA algorithm requires quadratic time per event in the size of the NWA.

Now that we have deterministic NWAs of small size for the queries of interest, one can hope to make this algorithm run in practice. The quadratic running time, however, is still too big. To see this we note that the sizes of the deterministic automata for the queries A1-A8 for the XPathMark benchmark is roughly between 400 and 2500. Therefore quadratic factor per event will be between 4002 = 160:000 and 62:500:000:000, which clearly is too big.

We contribute in Chapter 4 a new algorithm that we did not yet submit to a conference. This algorithm can find the certain query answers of a monadic query defined by a deterministic SHA in combined linear time in the size of the automaton and the stream, plus a polynomial time depending of the number certain query answers. The per-event time of our streaming algorithm is linear in the number of states of the SHA, if the polynomial time for the actual creation of the certain

queries answers is taken apart. The number of states for the automata obtained for the XPathMark benchmark is between 38 and 124, so the factor per event is reasonably small for our new algorithm. The shift of the model of deterministic automata from NWAs to SHAs is one of the keys that led us to the finding of the new algorithm for CQA, the first efficient algorithm without approximation. It makes us believe that CQA may be feasible in practice, in contrast to what we believed before. This conjecture still needs to be proven experimentally. We implemented a prototype of our algorithm, but it needs more work to become competitive with the best existing streaming algorithm for XPath queries from [Sebastian 2016]. In particular, we need to develop and integrate a projection technique for SHAs for replacing those for NWAs [Sebastian & Niehren 2016].

#### Complexity of CQA on Hyperstreams

As the third contribution, we determine the complexity of the decision version of CQA problem for hyperstreams [Boneva et al. 2019]. We show that CQA is Exp-complete, if queries are represented by nondeterministic SHAs, and Pspacecomplete in the case of SHAs. We also obtain a positive result for linear hyperstreams without compression, where CQA is in PTime.

Establishing the lower bounds is not very difficult, when knowing the complexity classes of standard automata problems, in particular of the universality problem and of tree automata and of finite state automata. Also the nonemptiness problems for a number of automata is relevant here. For the upper bounds some more work is needed. A hyperstream can be identified with a compressed pattern P for data hedges, whose variables do also denote data hedges. Let Inst(P) be the set of all completions of hyperstream P, that is the set of ground instances of the pattern. We first reduce CQA for general queries to CQA for Boolean queries. For the latter, CQA can be identified with the problem of regular pattern inclusion, i.e. whether Inst(P) L(A) for a given compressed pattern for unranked trees P and an SHA A with language L(A). Via determinization regular pattern matching can be reduced to the problem of regular pattern matching, i.e. whether Inst(P)\L(A) 6= ;. What is more tedious is to deal with the unrankedness of data hedges, and that pattern variables also match data hedges and not only unranked trees. The automata for the data hedges are then taken from the class of SHAs. The general idea to get rid of the unrankedness is to use a reduction to the case of ranked trees. But then variables for hedges have to be replaced by variables for contexts. Furthermore, one has to deal with the fact, that not all ranked trees are encodings of unranked trees. We do so by considering an generalized CQA problem, where one can express regular constraints on the values of the variables.

**An Approximation Algorithm for CQA on Hyperstreams**

The last contribution is the first algorithm that approximates CQA on hyperstreams for queries defined by SHAs. Our algorithm is in polynomial time for compressionfree hyperstreams, and correct in that it computes only certain query answers at any event. It applies to the general case, where the CQA problem is Exp-complete. However when restricted to the simpler case where the hyperstreams are linear and compression-free and that SHAs are deterministic, it is complete in that it outputs all certain query answers at every event.

As a first approximation we make the hyperstreams linear by replacing all occurrences of free variables in the hyperstream by fresh variables. In the third approximation we make hyperstreams compression free, by replacing its bound variables by fresh bound variables. For the second approximation step, we introduce the notion of strong certainty by exploiting the accessibility relation of the SHAs S defining the query, mainly in the same manner than in the efficient streaming algorithm for SHAs. The idea is to distinguish sets of states q of S that are safe for a hyperstream in that S must reach some final state when starting with q for all completions of the hyperstream. For instance, consider the hyperstream X = Y P where Y is a free variable and P a pattern. If Q is safe for P then the set of states that are safe for X is: safe(Q) = fq j not exists q0 62 Q: acc(q; q0)g.

**Table of contents :**

Notations

**1 Introduction **

1.1 Communication over Streams

1.1.1 Complex Event Processing

1.1.2 Query Languages

1.1.3 Certain Query Answering (CQA)

1.1.4 Quality Criteria

1.2 Problem: CQA on Hyperstreams

1.2.1 Hyperstreams

1.2.2 Finding Certain Query Answers

1.3 Contributions

1.3.1 Small Deterministic Automata for Regular Path Queries

1.3.2 Efficient CQA Algorithm on Streams

1.3.3 Complexity of CQA on Hyperstreams

1.3.4 An Approximation Algorithm for CQA on Hyperstreams

1.4 Further Related Work

1.5 Publications

1.6 Organization of the thesis

**2 Preliminaries **

2.1 Words, Trees, Nested Words and Languages

2.1.1 Words

2.1.2 Trees

2.1.3 Nested Words

2.2 Patterns

2.2.1 Definition

2.2.2 Identification to Logical Structures

2.3 Queries

2.3.1 V-Structures and Sequenced V-Structures

2.3.2 Certain answers and non-answers

2.4 Automata and Regular Expressions

2.4.1 Word Automata

2.4.2 Stepwise Tree Automata (STAs)

2.4.3 Nested Word Automata (NWAs)

2.4.4 Nested Regular Expressions

2.5 FXP

**3 Small Deterministic Automata for Navigational Queries **

3.1 Introduction

3.2 From FXP to Nested Regular Expressions

3.3 Stepwise Hedge Automata (SHAs)

3.3.1 Evaluation on Nested Words

3.3.2 Relation to STAs and NWAs

3.3.3 Determinization

3.3.4 Completeness and Pseudo-Completeness

3.3.5 Universality and Intersection Nonemptiness Problems

3.4 Compiler from nRegExp to SHAs

3.5 Reducing the size of (d)SHAs

3.5.1 Symbolic Apply Rules

3.5.2 Cleaning Methods for Determinized SHAs

3.6 Experimental Results for XPath Queries

**4 Certain Query Answering on Streams **

4.1 Modeling Streams of Hedges

4.1.1 … As String Patterns With Parentheses

4.1.2 … As Nested Patterns

4.2 About CQA Algorithms on Streams

4.3 A Streaming Algorithm for Boolean CQA

4.4 Certain Query Answering for Monadic Queries

4.4.1 Position-annotated patterns

4.4.2 Main differences with Boolean CQA

4.4.3 Description of the Algorithm

4.4.4 Correctness and Complexity of the Algorithm

4.5 Experiments

**5 Hyperstreams and Certain Query Answering **

5.1 Hyperstreams

5.1.1 Hyperstreams of Nested Words

5.1.2 Hyperstreams of Ranked Trees With Context Variables

5.2 Certain Query (Non) Answering on Hyperstreams

5.2.1 Definitions

5.2.2 From the Non-Boolean Cases to the Boolean Cases

**6 Complexity of Certain Query Answering **

6.1 Introduction

6.2 -Algebras

6.3 Inhabitation for Tree Automata

6.3.1 Tree Automata

6.3.2 Intersection NonEmptiness

6.3.3 Tree Inhabitation

6.3.4 Context Inhabitation

6.4 Evaluation of Compressed Tree Patterns over Ntas

6.5 Regular Matching and Inclusion

6.5.1 Lower Bounds

6.5.2 Upper Bounds

6.6 Adding Regular Constraints

6.7 Encoding Patterns for Unranked Trees

6.8 Linearity Restriction

**7 Approximating CQA on Hyperstreams **

7.1 Transitions for SHAs

7.2 Eliminating Hard Constraints: Linear Certainty

7.3 Safety Approximations

7.3.1 Safety Approximation by Accessibility

7.3.2 Safety Approximation by Accessibility and Self Loops

7.4 Strong Certainty

7.4.1 Parameterized Strong Certainty

7.4.2 Examples of Concrete Strong Certainty

7.5 Outlook

**8 Conclusion **

**Appendices**

.1 Navigational Forward XPath Queries of [Franceschet ]

.2 Additional Forward XPath Queries

.3 Deterministic NWAs for the expression ch(a + b)

**Bibliography**