Properties of Selectivity Estimation Techniques

Get Complete Project Material File(s) Now! »

Query Evaluation Strategy

The strategy used to evaluate the XPath query may affect the size of the buffering space B and the processing time. B might reach document size |D| in the worst case. For example, let us consider that we have the document D and the query Q: //A[./F]/C as it is shown in figure 1.5. In the so-called lazy approach, B = n or in other words B = |D| since the predicate of A is not evaluated until < /Ai > arrives. In this case all nodes starting from C1 to Cn have to be buffered, which will increase the buffering size remarkably. In the so-called eager approach B = 0 because the predicates of A is evaluated to be true the moment element < F > arrives. Thus, each < Ci > can be flushed as a query result the moment it arrives and does not need to be buffered at all. Obviously this will improve the buffering space performance. Query Q: XML Document D: Summary of D: A A (1) F (1) C (n) F F //A[./F]/C.

The performance prediction model should be time and space efficient no matter what query evaluation technique is used. In the example of figure 1.5, a possible solution to reduce the buffering space of the lazy approach is to evaluate Q on the summary of D. The number in the bracket to the right of the each element (in the summary of D) represents its frequency. Therefore, by applying the the lazy approach we buffer only one element C with it frequency (n) and we get the number of matches that is the value of n.

Evolution and Data Set Updating

When the underlying XML document (data set) D is updated, i.e. some elements are added or deleted, the performance prediction model should be able to model precisely the change in the cost for a given query Q on D (without the need to rebuild its « tables »). All the above challenges to performance prediction will be addressed in later chapters.

Properties of Selectivity Estimation Techniques

The design and the choice of a particular selectivity estimation technique depends on the problem being solved with it. Therefore, the technique needs to be constructed in a way related to the needs of the particular problem being solved [Aggarwal 2007].

In general, we would like to construct the synopsis structure in such a way that it has wide applicability across broad classes of problems. In our work, the applicability to streams of XML data makes the space and time efficiency issue of construction critical.

When looking for an efficient, capable (general enough) and accurate selectivity estimation technique for XPath queries, there are several issues that need to be addressed. Some of these issues can be summarized as follows :
• It must be practical: in general, one of the main usages of the selectivity es-timation techniques is to accelerate the performance of the query evaluation process. Thus, while theoretical guarantees are important for any proposed approach, practical considerations are much more important. The perfor-mance characteristics of the selectivity estimation process are a crucial as-pect of any approach. The selectivity estimation process of any query or sub-query must be much faster than the real evaluation process. In other words, the cost savings on the query evaluation process using the selectivity information must be higher than the cost of performing the selectivity esti-mation process. In addition, the required summary structure(s) for achieving the selectivity estimation process must be efficient in terms of memory con-sumption.
• It should support structural and data value queries: in principal, all XML query languages can involve structural conditions in addition to the value-based conditions. Therefore, any complete selectivity estimation system for the XML queries requires maintaining statistical summary information about both the structure and the data values of the underlying XML documents. A recommended way of doing this is to apply the XMill approach [Liefke 2000] in separating the structural part of the XML document from the data part and then group the related data values according to their path and data types into homogeneous sets. A suitable summary structure for each set can then be easily selected. For example, the most common approaches in summarizing the numerical data values is by using histograms or wavelets while several tree synopses could be used to summarize the structural part.
• One pass constraint: for streaming applications or techniques, the streams of XML data typically contain a large number of points, the contents of the stream cannot be examined more than once during the course of computation.

Therefore, all summary structure/data values construction algorithms should be designed under a one pass constraint.
• It should be strongly capable: the standard query languages for XML [Bray 2008] namely XPath [Berglund 2010] and XQuery [Boag 2010] are very rich languages. They provide rich sets of functions and features. These features include structure and content-based search, join, and aggregation operations. Thus, a good selectivity estimation approach should be able to provide accurate estimates for a wide range of these features. In addition, it should maintain a set of special summary information about the underly-ing source XML documents. For example: a universal assumption about a uniform distribution of the elements structure and the data values may lead to many potential estimation errors because of the irregular nature of many XML documents.
• It must be accurate: providing an accurate estimation for the query optimizer can effectively accelerate the evaluation process of any query. However, on the other hand, providing the query optimizer with incorrect selectivity in-formation will lead it to incorrect decisions and consequently to inefficient execution plans.
• It must evolve and be incremental: when the underlying XML document is updated, i.e. some elements are added or deleted, the selectivity estimation technique should be updated (without the need of rebuilding it) as well to provide an accurate selectivity estimation for a given query.
• It should be independent: it is recommended that the selectivity estimation process be independent of the actual evaluation process which facilitates its use with different query engines that apply different evaluation mechanisms. This property is an advantage for software engineering of the corresponding module(s).
• Time and Space Efficiency: In many traditional synopsis methods on static data sets (such as histograms), the underlying dynamic programming methodologies require super-linear space and time. This is not acceptable for a data stream [Aggarwal 2007]. For the case of space efficiency, it is not desirable to have a complexity which is more than linear in the size of the stream.

Path/Twig Selectivity Estimation Techniques

In this section, we give an overview of the literature related to the selectivity esti-mation approaches in the XML domain. Estimation techniques can be classified in terms of the structure used for collecting the summary information into two main classes:
1. Synopsis-based estimation techniques: this class of the estimation techniques uses tree or graph structures for representing the summary information of the source XML documents.
2. Histogram-based estimation techniques: this class of the estimation techniques uses the statistical histograms for capturing the summary information of the source XML documents.

Synopsis-Based Estimation Techniques

[Aboulnaga 2001] have presented two different techniques for capturing the structure of the XML documents and for providing accurate cardinality estimations for the path expressions. The presented techniques only support the cardinality estimations of simple path expressions without predicates and so-called recursive axes (repeated node-labels in the expression). Moreover, the models cannot be applied to twigs.

The first technique presented in this paper is a summarizing tree structure called a path tree. A path tree is a tree containing each distinct rooted path in the database (or data set) where the nodes are labeled by the tag name of the nodes.

To estimate the selectivity of a given path expression p in the form of s1/s2/…/sn, the path tree is scanned by looking for all nodes with tags that match the first tag of the path expression. From every such node, downward navigation is done over the tree following child pointers and matching tags in the path expression with tags in the path tree. This will lead to a set of path tree nodes which all corre-spond to the query path expression. The selectivity of the query path expression is the total frequency of these nodes.

The problem is the size of the path tree constructed form a large XML document, it is larger than the available memory size for processing. To solve this problem, the authors described different summarization techniques based on the deletion of low frequency nodes, and on their replacement by means of ∗ − nodes (star nodes). Each ∗ − node, denoted by a special tag name ” ∗ ”, denotes a set of deleted nodes, and inherits their structural properties as well as their frequencies. Unfortunately, the path tree is not formally defined in this work, and to the best of our knowledge, it is not defined in the literature.

The second technique presented in this paper is a statistical structure called Markov table (MT). This table, implemented as an ordinary hash table, contains any distinct path of a length up to m and its selectivity. Thus, the frequency of a path of length n can be directly retrieved from the table if n ≤ m, or it can be computed by using a formula that correlates the frequency of a tag to the frequencies of its m − 1 predecessors if n > m. Since the size of a Markov table may exceed the total amount of available main memory, the authors present different summarization techniques which work as in the case of a path tree and delete low frequency paths and replace them with ∗ − paths.

XPATHLEARNER [Lim 2002] is an on-line learning method for estimating the selectivity of XML path expressions by means of statistical summaries used to build a Markov histogram on path selectivities gathered from the target XML engine. It employs the same summarization and estimation techniques as presented in [Aboulnaga 2001].

The novelty of XPATHLEARNER is represented by the fact that it collects the required statistics from the query feedback in an on-line manner, without accessing and scanning the original XML data, which is in general resource-consuming.

These statistics are used to learn both tag and value distributions of input queries, and, when needed, to change the actual configuration of the underly-ing Markov histogram in order to improve the accuracy of approximate answers. From this point of view, XPATHLEARNER can be intended as a workload-aware method.

An important difference between XPATHLEARNER and the MT of [Aboulnaga 2001] is that the XPATHLEARNER supports the handling of predicates (to further refine the selected node-set) by storing statistical information for each distinct tag-value pair in the source XML document. Evolution by Training) for cost modeling of complex XML operators. It exploits a set of system catalogue statistics that summarizes the XML data, the set of simple path statistics and a statistical learning technique called transform regression instead of detailed analytical models to estimate the selectivity of path expressions. The technique used to store the statistics is the path tree of [Aboulnaga 2001].

This work is more oriented toward XML repositories consisting of a large corpus of relatively small XML documents. Their initial focus is only on the CPU cost model. To do that, they developed a CPU cost model for XNAV operator which is an adaptation of TurboXPATH [Josifovski 2005]. Their idea was taking from previous works in which statistical learning method are used to develop cost models of complex user-defined functions [He 2004] and [Lee 2004].

The system can automatically adapt to changes over time in the query workload and in the system environment. The optimizer estimates the cost of each operator in the query plan (navigation operator, join operator) and then combines their costs using an appropriate formula. The statistical model can be updated either at a periodic intervals or when the cost-estimation error exceeds a specified threshold. Updating a statistical model involves either re-computing the model from scratch or using an incremental update method.

Histogram-Based Estimation Techniques

As we mentioned before, this class of the estimation techniques uses the statistical histograms for capturing the summary information of the source XML documents. Below, we give a survey on the existing work.

[Freire 2002] have presented an XML Schema-based statistics collection technique called StatiX. This technique leverages the available information in the XML Schema to capture both structural and value statistics about the source XML documents. These structural and value statistics are collected in the form of histograms. The StatiX system is employed in LegoDB [Bohannon 2002]. LegoDB is a cost-based XML-to-relational storage mapping engine, which tries to generate efficient relational configurations for the XML documents.

The StatiX system consists of two main components. The first component is the XML schema validator which simultaneously validates the document against its associated schema and gathers the associated statistics. It assigns globally unique identifiers (IDs) to all instances of the types defined in the schema. Using these assigned IDs, structural histograms are constructed to summarize information about the connected edges. Value histograms are constructed for types that are defined in terms of base types such as integers. The storage of the gathered statistics is done using equi-depth histograms (wherein the frequency assigned to each bucket of the histogram is the same). The second component is the XML schema transformer which enables statistics collection at different levels of granularity. Although, StatiX is used in the context of the LegoDB system and the presented experimental results indicate highly accurate query estimates, the technique can only be applied to documents described by XML schemas with no clear view as to how it can be extended to deal with schema-less documents. Moreover, the paper [Freire 2002] does not show a clear algorithm for estimating the cardinality of the XQuery expression and there is no clear definition of the supported features and expressions of the language.

[Wu 2002] have presented an approach for mapping XML data into 2D space and maintaining certain statistics for data which fall into each pre-defined grid over the workspace. In this approach, each node x in an XML document D is associated with a pair of numbers, start(x) and end(x), numeric labels representing the pre-order and post-order ranks of the node in the XML document D. Each descendant node has an interval that is strictly included in its ancestors interval. For each basic predicate P a two-dimensional histogram summary data structure is built and collectively named as position histograms. In the position histograms data structure, the start values are represented by the x − axis while the end values are represented the y − axis. Each grid cell in the histogram represents a range of start position values and a range of end position values. The histogram maintains a count of the number of nodes satisfying the conditions of predicate P and has start and end positions within the specified ranges of the grid cell. Since the start position and end position of a node always satisfies the formula start <= end, none of the nodes can fall into the area below the diagonal of the matrix. So, only the grid cells that reside on the upper left of the diagonal can have a count of more than zero.

Table of contents :

1 Introduction
1.1 Introduction
1.1.1 Preliminaries
1.1.1.1 Data Model of XML Document
1.1.1.2 XPath
1.1.1.3 Recursion in XML Document
1.1.1.4 Document Depth
1.1.1.5 Stream-querying Process
1.1.1.6 Stream-filtering Process
1.1.1.7 Synopsis
1.1.1.8 Selectivity Estimation Technique
1.1.1.9 Performance Prediction Model
1.2 Challenges
1.2.1 The Expressiveness of XPath
1.2.2 Structure of XML Data Set
1.2.3 Query Evaluation Strategy
1.2.4 Evolution and Data Set Updating
1.3 Contributions
1.4 Thesis Organisation
1.4.1 The Dependency of Thesis’s Chapters
2 State of the Art
2.1 Introduction
2.2 Selectivity Estimation
2.2.1 Properties of Selectivity Estimation Techniques
2.2.2 Path/Twig Selectivity Estimation Techniques
2.2.2.1 Synopsis-Based Estimation Techniques
2.2.2.2 Histogram-Based Estimation Techniques
2.2.3 Summary – The Choice of the Path tree Synopsis
2.3 Stream-processing Approaches
2.3.1 Stream-filtering Algorithms
2.3.2 Stream-querying Algorithms
2.3.3 Summary – Lazy Stream-querying Algorithm LQ
3 Path tree: Definition, Construction, and Updating
3.1 Introduction
3.1.1 The XML Data Model
3.2 Path tree Definition
3.3 Path tree Construction: Automata Technique
3.3.1 Automaton Definition A
3.3.2 Automata Transformation into a Graph Doc(A)
3.3.3 Automata Minimization AMin
3.3.4 Example of Path tree Construction: Automata Technique
3.4 Path tree Construction: Streaming Technique
3.4.1 Path tree Construction
3.4.2 Path tree Updating
4 Selectivity Estimation Techniques
4.1 Introduction
4.2 Lazy Stream-querying Algorithm
4.2.1 Query Preprocessing
4.2.2 LQ – Blocks Extension
4.2.3 Examples of Query Processing Using LQ-Extended
4.2.3.1 Query Processing – Simple Path
4.2.3.2 Query Processing – Twig Path
4.3 Selectivity Estimation Algorithm
4.3.1 Examples of the Selectivity Estimation Process
4.3.1.1 Selectivity Estimation – Simple Path
4.3.1.2 Selectivity Estimation – Twig Path
4.3.2 Accuracy of the Selectivity Estimation Technique
5 Performance Prediction Model
5.1 Introduction
5.2 Performance Prediction Model- Preliminaries
5.2.1 Performance Prediction Model – Motivations
5.2.2 Performance Measurements Towards the Optimization of Stream-processing for XML Data
5.2.2.1 Prototype O-Search
5.2.2.2 Experimental Results
5.2.2.3 Conclusion
5.2.3 Performance Prediction Model – General Structure
5.3 Performance Prediction Model – Simple Path
5.3.1 Lazy Stream-querying Algorithm (LQ)
5.3.2 Building the Mathematical Model
5.3.3 Building the Prediction Model
5.3.3.1 Prediction Rules
5.3.4 User Protocol
5.3.5 Experimental Results
5.3.5.1 Experimental Setup
5.3.5.2 Quality of Model Prediction
5.3.5.3 Impact of Using Metadata in our Model on the Performance
5.3.5.4 Model Portability on Other Machines
5.3.6 Conclusion
5.4 Performance Prediction Model – Twig Path
5.4.1 Lazy Stream-querying Algorithm (LQ)
5.4.2 Building the Mathematical Model
5.4.3 Building the Prediction Model
5.4.3.1 Example of the Selectivity Estimation Process .
5.4.4 Experimental Results
5.4.4.1 Experimental Setup
5.4.4.2 Accuracy of the Selectivity Estimation
5.4.4.3 Efficiency of the Selectivity Estimation Algorithm
5.4.4.4 Comparing our Approach with the other Approaches
5.4.5 Use Case: Online Stream-querying System
5.4.5.1 Online Stream-querying System
5.4.6 Conclusion and Future Work
6 Conclusion and Perspectives
6.1 Conclusion
6.2 Future Work
6.2.1 Stream-processing
6.2.2 Selectivity Estimation Technique
6.2.3 Parallel Processing
Bibliography