Histogram-Based Estimation Techniques
As we mentioned before, this class of the estimation techniques uses the statistical histograms for capturing the summary information of the source XML documents. Below, we give a survey on the existing work.
[Freire 2002] have presented an XML Schema-based statistics collection technique called StatiX. This technique leverages the available information in the XML Schema to capture both structural and value statistics about the source XML documents. These structural and value statistics are collected in the form of histograms. The StatiX system is employed in LegoDB [Bohannon 2002]. LegoDB is a cost-based XML-to-relational storage mapping engine, which tries to generate efficient relational configurations for the XML documents.
The StatiX system consists of two main components. The first component is the XML schema validator which simultaneously validates the document against its associated schema and gathers the associated statistics. It assigns globally unique identifiers (IDs) to all instances of the types defined in the schema. Using these assigned IDs, structural histograms are constructed to summarize information about the connected edges. Value histograms are constructed for types that are defined in terms of base types such as integers. The storage of the gathered statistics is done using equi-depth histograms (wherein the frequency assigned to each bucket of the histogram is the same). The second component is the XML schema transformer which enables statistics collection at different levels of granularity. Although, StatiX is used in the context of the LegoDB system and the presented experimental results indicate highly accurate query estimates, the technique can only be applied to documents described by XML schemas with no clear view as to how it can be extended to deal with schema-less documents. Moreover, the paper [Freire 2002] does not show a clear algorithm for estimating the cardinality of the XQuery expression and there is no clear definition of the supported features and expressions of the language.
[Wu 2002] have presented an approach for mapping XML data into 2D space and maintaining certain statistics for data which fall into each pre-defined grid over the workspace. In this approach, each node x in an XML document D is associated with a pair of numbers, start(x) and end(x), numeric labels representing the pre-order and post-order ranks of the node in the XML document D. Each descendant node has an interval that is strictly included in its ancestors interval. For each basic predicate P a two-dimensional histogram summary data structure is built and collectively named as position histograms. In the position histograms data structure, the start values are represented by the x − axis while the end values are represented the y − axis. Each grid cell in the histogram represents a range of start position values and a range of end position values. The histogram maintains a count of the number of nodes satisfying the conditions of predicate P and has start and end positions within the specified ranges of the grid cell. Since the start position and end position of a node always satisfies the formula start <= end, none of the nodes can fall into the area below the diagonal of the matrix. So, only the grid cells that reside on the upper left of the diagonal can have a count of more than zero.
Given a predicate P1 associated with the position histograms H1 and a predicate P2 associated with the position histograms H2, estimating the number of pair of nodes u, v where u satisfies P1 and v satisfies P2 and u is an ancestor of v is done either in an ancestor − based fashion or in a descendant − based fashion. The ancestor − based estimation is done by finding the number of descendants that joins with each ancestor grid cell. The descendant − based estimation is done by finding the number of ancestors that are joined with each descendant grid cell.
The authors presented another type of histograms named coverage histogram to increase the accuracy of the estimation in cases where the schema information is available. For a given predicate P, using the schema information it can be known if the two nodes satisfying the predicate do not have any ancestor-descendant relationship. To deal with this no-overlap situation, additional information is stored in the form of a coverage histogram. The coverage histogram for a predicate P Cvgp[i][ j][m][n] represents the number of nodes in cell (i, j) that are descendants of nodes in cell (m, n) satisfying P. Although, the model handles the estimation of twig queries well, it is very limited to the ancestor and descendant paths and has no clear way for extension to the other paths. Moreover, the number of constructed position histograms is proportional to the number of the interested predicates which is considered to be relatively high.
Follow-up work has improved on the ideas of interval histograms by leveraging adaptive sampling techniques [Wang 2003]. In this work, the proposed technique treats every element in a node set as an interval, when the node set acts as the ancestor set in the join or a point or when the node set acts as the descendant set. Two auxiliary tables are then constructed for each element set. One table records the coverage information when the element set acts as the ancestor set, while the other captures the start position information of each element when the element set acts as the descendant set. To improve the accuracy of the estimated results, sampling-based algorithms are used instead of the two-dimensional uniform distribution assumption as used in [Wu 2002].
[Wang 2004] have proposed a framework for XML path selectivity estimation in a dynamic context using a special histogram structure named bloom histogram (BH). BH keeps a count of the statistics for paths in XML data. Given an XML Document D, the path-count table T (path, count) is constructed such that for each pathi in D, there is a tuple ti in T with ti.path = pathi and ti.count = counti where counti is the number of occurrences of pathi. Using T , a bloom histogram H is constructed by sorting the frequency values and then grouping the paths with sim-ilar frequency values into buckets. Bloom filters are used to represent the set of paths in each bucket so that queried paths can be quickly located.
To deal with XML data updates and the dynamic context, the authors proposed a dynamic summary component which is an intermediate data structure from which the bloom histogram can be recomputed periodically. When data updates arrive, not only the XML data is updated but the updated paths are also extracted, grouped and propagated to the dynamic summaries. Although, the bloom histogram is designed to deal with data updates and the estimation error is theoretically bounded by its size, it is very limited as it deals only with simple path expressions of the form /p1/p2/…/pn and //p1/p2/…/pn.
[Li 2006] have described a framework for estimating the selectivity of XPath expressions with a main focus on the order-based axes (following, preceding, following-sibling, and preceding-sibling). They used a path encoding scheme to aggregate the path and order information of XML data. The proposed encoding scheme uses an integer to encode each distinct root-to-leaf path in the source XML document and stores them in an encoding table. Each node in the source XML document is then associated with a path id that indicates the type of path where the node occurs. Additionally, they designed a PathId-Frequency table where each tuple represents a distinct element tag and aggregates all of its associated element tags with path ids and their frequency. To capture the order information, they used the Path-order table associated to each distinct element tag name to capture the sibling-order information based on the path ids. Figure 2.7 illustrates an example of XML document and its corresponding path encoding scheme.
For estimating the cardinality of XPath expressions, the authors introduced the Path Join algorithm. Given an XPath query Q, the path join retrieves a set of path ids and the corresponding frequencies for each element tag in Q from the PathId-Frequency table. For each pair of adjacent element tags in Q, they use a nested loop to determine the containment of the path ids in their sets. Path IDs that clearly do not contribute to the query result will be removed. The frequency values of the remaining path ids will be used to estimate the query size. The algorithm uses the information of the path-order table to compute the selectivity of the (following-sibling, preceding-sibling) axes that may occur in Q. The XPath expression which involves the preceding or following axes is converted into a set of XPath expres-sions involving only preceding-sibling or following-sibling axes according to the path ids of the nodes associated with preceding or following axes after the path ID join. Then, the estimation result is given by the selectivity sum of the set of path expressions. The authors introduced two compact summary structures called p-histogram and o-histogram, to summarize the path and order information of XML data respectively. A p-histogram is built for each distinct element tag to summarize the pathId-frequency information. In this histogram, each bucket contains a set of path ids and their average frequency value. Based on the observation that the path-order table is very sparse and the frequencies in the majority of the cells are zero, the o-histogram is designed to summarize the path-order information where only the cells with non-zero values are stored. Although, the proposed model is the first work to address the problem of cardinality estimation of XPath expression with order-based axes, it is unfortunately not clear how an extension can be introduced to support predicates.
Summary – The Choice of the Path tree Synopsis
Some of the main usages of selectivity estimation techniques are to accelerate the performance of the query evaluation process and to estimate the cost for a given query. Furthermore, a good technique should be able to provide accurate estimates for a large fragment of XPath. These techniques should also support structural and data value queries. The synopsis of the estimation technique should be constructed rapidly (one pass on the XML Data) for the different types (deep, large, recursive,..etc.) of XML data sets. In addition, the required summary structure(s) for achieving the selectivity estimation process must be efficient in terms of memory and space consumption.
Our main objective is to build an estimation selectivity technique for the fragment of Forward XPath (defined in 18.104.22.168) with the above mentioned features.
Below, we summary and compare the related work based on the following cri-teria:
• The Fragment of XPath: Some techniques estimate the selectivity for only path expressions and they do not support twigs, e.g., [Aboulnaga 2001], [Wang 2004], [Li 2006], and [Fisher 2007]. Others, support twigs with struc-tural queries only, so can not support twigs with text(), e.g., [Zhang 2006b] and [Polyzotis 2004a].
The XSeed technique [Zhang 2006b] can not process a nested expres-sions (nested predicates) [Sakr 2010]. While the TreeSktech technique [Polyzotis 2004a] does not support queries with Ancestor-Descendant rela-tionships neither queries with ′text()′ [Luo 2009].
The XCLUSTER [Polyzotis 2006] addresses the summarization problem for structured XML content, but its construction time is unknown. Furthermore, as it is mentioned in [Sakr 2010] it does not process a nested expressions (nested predicates).
Some work has been conducted to estimate the selectivity for XQuery.
The design and implementation of a relational algebraic based framework for estimating the selectivity of XQuery expressions was described in [Saker 2007], [Saker 2008].
• The construction time of the Synopsis: few papers present the time needed to construct their synopses (summaries). The construction time of TreeSk-tech [Polyzotis 2004a] for the complex data set TreeBank 86MiB (depth 36) took more than 4 days, this result was confirmed in [Luo 2009]. XSeed treats the structural information in a multi-layer manner, the XSeed synopsis is sim-pler and more accurate than the TreeSketch synopsis. However, although the construction of XSeed is generally faster than that of TreeSketch, it is still time-consuming for complex datasets.
Other techniques, do not present the construction time for their synopses (summaries), for example: XCLUSTER [Polyzotis 2006].
• Recursion in the data set : the XSeed and the XCLUSTER synopses are more general than the TreeSketch synopsis because the latter does not support the recursion in the data sets as it is explained in [Zhang 2006b].
• Selectivity of structural queries and synopsis size: several structure synopses, such as Correlated Suffix Trees [Chen 2001], Twig-Xsketch [Polyzotis 2004b], TreeSketch [Polyzotis 2004a], and XSeed [Zhang 2006b] store some form of compressed tree structures and simple statistics such as node counts, child node counts, etc. Due to the loss of information (in particularly the structure of the original data set), selectivity estimation heavily relies on the statistical assumptions of independence and uniformity. Consequently, they can suffer from poor accuracy when these assumptions are not valid. The above proposed structures synopses can not be evaluated by ordinary query evaluation algorithms, they require specialized estimation algorithms.
• Incremental update: minimal synopsis size seems desirable but won′t be the best because incremental maintenance would be difficult [Goldman 1997].
This is the case of many selectivity estimation techniques such as: Correlated Suffix Trees [Chen 2001], TreeSketch and XSeed.
The path tree structure synopsis was introduced by [Aboulnaga 2001] to es-timate the selectivity for path expression only. This structure was used by [Zhang 2005]. But to the best our knowledge, this structure was not formally de-fined in the literature. It has overall advantages like complete structural information and the possibility of being evaluated by streaming algorithms. This synopsis cap-tures the structure of the XML document and permits by using an efficient stream-querying algorithm to estimate efficiently the selectivity for any query belongs to the fragment of Forward XPath.
We propose to use a stream-querying algorithm and an adapted path tree syn-opsis to optimize and improve the space consumption of the selectivity estimation process.
In the next section 2.3, we present several stream-processing approaches, we then compare them to find the best approach that can be used to traverse the path tree structure synopsis and to estimate efficiently the selectivity for any query which belongs to the fragment of Forward XPath.
Then the next chapter 3, we formally define the path tree synopsis. Furthermore, we give different algorithms to construct and update it.
Much research has been conducted to study the processing of XML documents in streaming fashion. The different approaches to evaluate XPath queries on streams of XML data can be categorized as follows (1) stream-filtering: determining whether there exists at least one match of the query Q in the XML document D, yielding a boolean output, for example XTrie [Chan 2002]. (2) Stream-querying: finding which parts of D match the query Q. This implies outputting all answer nodes in a XML document D i.e. nodes that satisfy a query Q. An example of stream-querying research is XSQ [Peng 2003].
A filtering system delivers documents to users based on their expressed interests (queries). Figure 2.8 shows the context in which a filtering system operates. There are two main sets of inputs to the system: user profiles (queries) and the stream of documents. User profiles describe the information preferences of individual users. These profiles may be created by the users themselves, e.g., by choosing items in a Graphical User Interface, or may be created automatically by the system using machine learning techniques. The user profiles are converted into a format that can be efficiently stored and evaluated by the filtering system. These profiles are effectively standing queries which are applied to all incoming documents. Hereafter, profiles and queries are used interchangeably.
Table of contents :
1 General Introduction
1.1 Wireless Sensor Networks
1.2.1 Military applications
1.2.2 Environmental applications
1.2.3 Health Applications
1.2.4 PODS Project
1.3 WSN Services
1.3.1 Time Synchronization
1.3.2 Location Discovery
1.3.3 Data Aggregation
1.3.4 Data Storage
1.3.5 Topology Management and Message Routing
1.4 Sensor Operating Systems
1.5 Thesis Outline
2 Location Estimation Methods
2.2 RSS-Based Locationing Algorithms
2.2.1 RSSI-Based Location Estimation Using Trilateration
2.2.2 Location Estimation Based on Location Fingerprinting
2.2.3 Cooperative Location Estimation
2.2.4 Ad Hoc Positioning System
2.3 Angle-of-Arrival Based Algorithms
2.4 Time-Based Algorithms
2.4.1 APS Using AoA
3 Detection of Measurement Errors
3.2 Related works
3.3 WSN modeling
3.3.1 Communication model
3.3.2 Fault model
3.4 Fault detection
3.5 TSK fuzzy treatment
3.6 Neural network treatment
3.7 Implementation of the proposed fault detection approach
3.8 Simulation results
3.8.1 Transient fault tolerance
3.8.2 Recurrent FIS Treatment
4.3 Formulation Of The Problem
4.3.1 System Model
4.3.2 Communication Model
4.4 Proposed Approach
4.4.1 Finding the Topology
4.4.2 Symmetry, Orientation and Position of the Topology
4.5 Working Example
4.6 Simulation Results
4.6.1 Computational complexity
4.6.2 Time complexity
4.6.3 Scalability with bounded error in measurements
5 Signal Strength Loss Compensation
5.2 Problem Formulation
5.3 Presentation of the solution
5.4 Simulation Results
5.5 Treatment through Neural Network
6 Conclusion and Future Perspectives