Query Evaluation Algorithms for Full XPath

Get Complete Project Material File(s) Now! »

Formal Semantics of XPath 3.0 in λXP

We introduce λXP, a typed first-order logical language for querying data trees based on first principles, supporting path navigation, higher-order functions, recursion, aggregates, tuples, lists, strings, and numbers.
To the best of our knowledge, the language λXP is diﬀerent to previous higher-order query languages [Benedikt 2015], in that λXP relies on a mode systems which permits to distinguish logical subformulas for which solutions must be searched, and functional subformulas that can be evaluated deterministically such as functional programs.
The purpose of λXP in this thesis is to serve as a core language for implementing evaluators for XPATH 3.0. Therefore, we show how to compile all of XPATH 3.0 to λXP with one minor exception. The compiler can be understood as a compact for-malization of what is described informally in the specification of XPATH 3.0. Even though formal, we believe that some of the details of the XPATH 3.0 specification by the W3C are easier to understand through our compiler to λXP.

Early Query Answering by Early Nwas

Queries in the navigational fragment of λXP can be compiled to nested work au-tomata (NWAs) as shown in [Gauwin 2009b], under the assumption that there are no backwards axis.
NWAs are the natural kind of pushdown machines for processing nested words such as XML streams [Alur 2004]. They can be determinized in contrast to standard pushdown automata, basically since they have the same expressiveness as tree au-tomata. Furthermore, EQA for queries defined by deterministic NWAs can be done in cubic time per event [Gauwin 2009b, Madhusudan 2009]. This is an interesting result, but since the resulting algorithm always requires cubic time in practice, this is by far too slow. Furthermore, determinism is assumed, which must be obtained by static determinization, which may lead to an exponential explosion – which does arise for many practically relevant queries – and even before the cubic time algorithm is applied. We propose an approximation of EQA for navigational λXP queries, which does not require static determinization. For this we introduce so called early NWAs, which besides final states have selection and rejection states. Whenever a selection (rejection) state is reached, any continuation of the nested word will be accepted (rejected). We then show how to introduce appropriate selection and rejection states when compiling navigational λXP queries to early NWAs. What we obtain is an approximation of EQA, since it may still be the case that some candidate is safe for selection even though the early NWA is not in a selection state. But deciding whether an early NWA is in a selection state is much easier that deciding the aliveness of candidates, since the latter depends on its stack and its state. Indeed, the approximation is indeed exact for all positive navigational XPATH queries without valid or unsatisfiable subfilters as shown in [Lick 2013]. Our streaming algorithm for navigational λXP queries runs the corresponding early NWA for all possible candidates, while performing on-the-fly determinization. A candidate is created and buﬀered only if it does not go into a selection or rejection state. A candidate from the buﬀer is output when the state of the candidate is a selection state, and discarded when it is a selection or rejection state.
In order to avoid inspecting all candidates at all events, we propose a stack and state sharing algorithm. The idea is that we can share the computation of any two candidates in the same state, so that the running time does not depend on the number of buﬀered candidates, but only on the number of states of the deterministic NWA created by on-the-fly determinization. Our streaming algorithm with stack-and-state sharing for answering queries by early NWAs is original and nontrivial, and enables tight upper bounds for time and space complexity that we prove.

Mapping to Nested Words and Back to Xml Documents

From an XML data tree we can produce a nested word by the linearization of data trees as in Section 2.2.2. This requires abstraction to maximal unary signature XML data trees obtained by concatenation of data values using a special separation symbol. Such nested words can be interpreted as XSLT programs. The nested word obtained from the XML data tree in Figure 2.1 is given in Figure 2.7. In turn it can be interpreted as a representation of the XSLT program in Figure 2.8. Indeed, this XSLT program evaluates to the original XML document in Figure 2.6.
We summarize the correlations XML documents, data trees, and nested words via the bijections depicted in Figure 2.9. It allows to identify the nested word for some XML document D by nw (D) = lin(relconcat (data-model(D))).

Table of contents :

1 Introduction
1.1 Xml Processing based on XPath
1.1.1 Schema Validation
1.1.2 Query-based Transformation
1.1.3 XPath
1.1.4 Regular Extension of XPath
1.1.5 In Memory Evaluation
1.1.6 Streaming Evaluation
1.2 Open Problems
1.2.1 Low Coverage
1.2.2 Low Latency
1.2.3 Low Time Efficiency
1.2.4 Lack of Formal Semantics
1.3 Contributions
1.3.1 Formal Semantics of XPath 3.0 in λXP
1.3.2 Early Query Answering by Early Nwas
1.3.3 Networks of Early Nwas
1.3.4 Projection for Nwas
1.3.5 Implementation and Experiments
1.4 Organization
1.5 Publications
2 Preliminaries
2.1 Data Trees
2.1.1 Definition
2.1.2 Logical Structure
2.1.3 Finite State Abstractions
2.1.4 Json Trees
2.2 Nested Words
2.2.1 Definition
2.2.2 Linearization of Data Trees
2.3 Xml Data Trees
2.3.1 Xml Data Model
2.3.2 Xml Data Model Specifics
2.3.3 Mapping to Nested Words and Back to Xml Documents
2.4 Nested Word Automata
2.4.1 Definition
2.4.2 Runs of Nested Words
2.4.3 Runs on Marked Trees
2.4.4 Determinization of NWAs
2.5 Recursive Functions
2.5.1 Partial Orders and Fixed Points
2.5.2 Lambda and Letrec Notation
I A Formal Semantics of XPath 3.0
3 Navigational XPath 3.0
3.1 Overview
3.2 Grammar
3.2.1 Ebnf
3.2.2 Expressions
3.2.3 Parse Tree
3.3 Path Expressions
3.3.1 Forward and Backward
3.3.2 Label Tests
3.3.3 Kind Tests
3.3.4 Relative, Absolute, and Abbreviated Expressions
3.3.5 Filters
3.3.6 Compositions
4 Types and Values
4.1 Sequence Type Expressions
4.2 Basic Types
4.2.1 Atomic Types
4.2.2 Union Types
4.2.3 Node Types
4.2.4 Function Types
4.3 Sequence Types
4.4 Subtyping
4.4.1 Item Types
4.4.2 Sequence Types
4.5 Dynamic Type Checking
5 Full XPath 3.0
5.1 Grammar
5.1.1 Basic Expressions
5.1.2 Navigational Expressions
5.1.3 Postfix Expressions
5.1.4 Terminal Expressions
5.1.5 Parse Tree
5.2 Values
5.2.1 Strings and Numbers
5.2.2 Booleans
5.2.3 Functions
5.2.4 Sequences
5.2.5 Conversions
5.3 Positions
5.4 Navigation
5.4.1 Path Expressions
5.4.2 Map Operator
5.5 First-Order Connectives
5.5.1 Arithmetics
5.6 Data Comparisons
5.6.1 Atomic
5.6.2 Nodes
5.6.3 Sequences
5.7 Ordered Sets
5.7.1 Duplicate-free Sequences
5.7.2 Computing Sets when possible
5.8 Error Handling
5.8.1 Kinds of Errors
5.8.2 Try and Catch Expressions
5.9 Documents
5.10 Regular Extension of XPath 3.0
5.11 XPathMark Benchmark
6 λXP
6.1 Types and Values
6.1.1 Typing Parameters
6.1.2 Atomic Types
6.1.3 Types
6.1.4 Subtyping
6.1.5 Equality Types
6.1.6 Admissible Types
6.2 λXP Queries
6.2.1 Philosophy of Sets and Functions
6.2.2 Constants
6.2.3 Syntax
6.2.4 Positive Contexts
6.2.5 Type System
6.2.6 Library Functions
6.2.7 Recursive Functions
6.2.8 Regular Axes
6.2.9 Datalog
6.2.10 Data Comparisons
6.3 Semantics
6.3.1 Complete Partial Orders
6.3.2 Denotational Semantics
6.3.3 Well-definedness
6.3.4 Correctness
6.4 Safety Restrictions
7 XPath to λXP
7.1 Instantiation of λXP
7.2 Library functions
7.2.1 Access to Node Labels
7.2.2 Lists of Nodes
7.2.3 String Data Value of Nodes
7.2.4 Atomization of Sequences
7.2.5 Numerics
7.2.6 Booleans
7.2.7 Strings
7.2.8 EQNames
7.2.9 Positions
7.2.10 Roots and Input Documents
7.3 Compiling the XPath 3.0 Grammar to λXP
7.3.1 Navigational XPath Expressions
7.3.2 Arbitrary XPath Expressions
7.3.3 Ordering Mode of XQuery
II Early Query Answering for Navigational XPath
8 Early Nested Word Automata
8.1 Introduction
8.2 Automata Queries
8.2.1 Tuple Selection
8.2.2 Monadic Queries
8.2.3 Queries by Automata
8.2.4 Earliest Query Answering
8.3 Early Nested Word Automata
8.4 Fxp
8.5 Compiler from FXP to Early Nested Word Automata
8.5.1 ENWA Descriptors
8.5.2 When Variables Must be Bound
8.5.3 Construction of ENWA Descriptors
8.5.4 Construction of the eNwa descriptor DF (E)
8.5.5 Correctness
8.5.6 Size of Automata Descriptors
8.5.7 Time to Compute Automata Descriptors
8.6 Early Query Answering
8.6.1 On-the-fly Instantiation and Determinization
8.6.2 Streaming Algorithm for Deterministic ENWAs
8.6.3 Adding Stack-and-State Sharing
8.7 QuiXPath tool
8.7.1 Implementation, Tools, and Applications
8.7.2 Benchmarks
8.7.3 Performance Comparison
8.7.4 Performance Analysis
8.7.5 Detailed Analysis
9 Projection
9.1 Introduction
9.2 A Nested Word Automaton
9.3 Projection NWAs
9.3.1 Projected Nested Words
9.3.2 Projection Nested Word Automata
9.3.3 Examples of PNwas and Runs
9.3.4 Evaluation of Projection Nwas
9.4 Irrelevant Labels and Prefixes of Nested Words
9.5 Projection from Nwas to PNwas
9.6 Node Selection
9.7 Experiments
III Query Evaluation Algorithms for Full XPath
10 In-Memory Evaluation of λXP Queries
10.1 Simplification of λXP Queries
10.1.1 Simple Queries
10.1.2 Decomposition into Networks of Navigational Queries
10.1.3 Backward Axes Elimination
10.2 Evaluation Algorithm
10.2.1 Specification
10.2.2 Implementation
10.2.3 Examples
11 Streaming Evaluation of λXP Queries
11.1 Streaming Evaluators
11.1.1 Restrictions
11.1.2 Linearizing a Sequence of Trees
11.1.3 Updates
11.1.4 Extended Types
11.1.5 Open Values
11.1.6 Registrations
11.1.7 Program and Query Evaluators
11.1.8 Running Evaluators over Streams
11.2 Evaluators
11.2.1 Generic Functions
11.2.2 Inductive Construction of Evaluators
11.3 Examples
11.3.1 Navigational Queries
11.3.2 A Network of Queries
11.4 Tuple Sharing
11.5 Experiments
12 Conclusion and Outlook
A XPathMark Benchmark
A.1 Downward Queries
A.2 Axis Queries
A.3 Comparison Queries
A.4 Aggregation Queries
A.5 Position Queries
A.6 Closure Queries
B XPath to λXP Compiler
B.1 Basic Expressions
B.1.1 XPath expressions
B.1.2 Nonterminals to define inline function expressions
B.1.3 Sequence constructions
B.1.4 Single expressions
B.1.5 Sequence decompositions
B.1.6 Let expressions
B.1.7 Quantified expressions
B.1.8 First-Order expressions
B.1.9 Comparisons
B.1.10 Concatenation of strings
B.1.11 Range expressions
B.1.12 Arithmetic Expressions
B.1.13 Ordered Sets
B.1.14 Type checks and casts
B.1.15 Unary Expressions
B.1.16 Value expressions
B.1.17 Comparison operators
B.1.18 Simple Map expressions
B.2 Navigational Expressions
B.2.1 Absolute path expressions
B.2.2 Relative path expressions
B.2.3 Steps
B.2.4 Axis steps and Filters
B.2.5 Axis Steps
B.2.6 Node Tests, Name Tests, and Wildcards
B.3 Postfix expressions
B.3.1 Postfix expressions
B.3.2 Argument Lists and Predicates
B.3.3 Primary expressions
B.3.4 Literals
B.3.5 Variables
B.3.6 Parenthesized expressions and the empty list
B.3.7 Context Item, Position, and Size
B.3.8 Function Calls
B.3.9 Functions
B.3.10 Simple Types
B.4 Sequence Type Expressions
B.4.1 Top level Sequence Type Matching
B.4.2 Item types
B.4.3 Atomic and union types
B.4.4 Kind Tests
B.4.5 Node and Type Names
B.4.6 Function Tests
B.4.7 EQNames
B.5 Terminal Expressions
Bibliography