ROSES Query Language and Logical Algebra – Project topics materials

Get Complete Project Material File(s) Now! »

XML-stream query engines

Even if RSS/Atom feeds may be represented using the XML data format, they exhibit a nearly flat structure, and it is more convenient to consider RSS/Atom items as textual relational tuples. Below we describe several significative XML-stream DSMSs proposed in the literature. NiagaraCQ Niagara system endeavours to answer queries over distributed XML documents crawled across a large network [NDM+01]. It proposes an XML-based query language that facilitates join specification over XML documents and the construction of complex results. The NiagaraCQ subsystem [CDTW00] proposes some continuous queries sharing techniques based on subquery grouping. In this project, Viglas and Naughton [VN02] introduce a cost model based upon stream rates to enable query optimization over streaming data. Niagara is conceived to process XML-streams of non-textual data, such as datastreams generated by RFID sensors or stock market datastreams. We work with datastreams that have a strong textual component. This difference in the type of handled data entails a substantial difference on the type of operators and, consequently, the costs models for each system diverge considerably.
OptimAX OptimAX is a project of the Gemo group at INRIA. The OptimAX system deals with Active XML (AXML) query optimization in a distributed setting. An Active XML document is a document including several WebService calls. These service calls are mainly XQuery queries returning an asynchronous stream of XML results. Abiteboul et al. [AMZ07, AMZ08] have extended the AXML language with three new operators (newNode, send and receive). AXML Web Services may include recursive Web Service calls, thus many call activation strategies may be carried out. The authors propose a cost-based optimization model for AXML documents computation. Their approach consists in building an initial AXML execution plan and iteratively improving this plan through the use of different heuristics.
In [AM07] Abiteboul and Marinoiu propose a monitoring overlay for P2P networks. Their system called P2P Monitor allows to handle streamed monitoring data across the P2P network. Peers produce monitoring data that may be filtered and/or reused by other peers. In this article the authors mainly propose an efficient filtering technique and a stream reuse mechanism. The filtering technique is based on the Atomic Event Set (AES) [NACP01] and YFilter [DFFT02] algorithms. AES is used in a first step to evaluate simple filtering conditions, afterwards YFilter evaluates the complex queries if still needed. While using a different approach in ROSES to optimize queries on continuous feeds, i.e. we optimize large sets of queries by filtering factorization, OptimAX proposes interesting clues to extend our optimization model to distributed settings.

Multi-query optimization problem

Multi-query optimization (MQO) has first been studied in the context of DBMSs where different users could request one-shot queries simultaneously to a system. For instance, in a system where various users can send different queries within the same time interval, one possible execution strategy is batching the queries. However, a major problem with this approach is the effect on the response time. It is unacceptable to delay a user’s query due to other more expensive queries. A better approach should consider reusing shared intermediate results between the queries [Sel88, SG90]. In the context of DSMS, MQO consists in exploiting the fact that a set of continuous queries can be evaluated more efficiently together than independently, because queries often share state and computation. Solutions using these observations are based on different mutualizing methods:
• Predicate indexing: used for indexing subscriptions in publish/subscribe systems, e.g. indexing by ranked posting lists [WGMB+09].
• Join interval indexing: used for sharing join computations, they index the intervals of the predicate values on join queries (e.g., [AXYY09]).
• Sharing states in global NFA: used for sequence detection, e.g., with YFilter [DFFT02] or with the pub/sub system Cayuga [DGH+06].
• Join graphs, e.g. [HDG+07], or Similarity joins hashing, e.g. [TRFZ07]; and finally.
• Sub-query factorization: with the HA algorithm [SG90], NiagaraCQ [CDTW00], TelegraphCQ [CCD+03a], STREAM [ABW06] and RUMOR [HRK+09]. 2
In ROSES, we have followed the factorization approach which appeared to be the most promising for a cost-based multi-query optimization solution.

Data model and logical algebra

The ROSES Data Model and Operator Algebra borrow from state-of-the-art datastream models and in particular from that proposed by Krämer in [Krä07], with specific modeling choices adapted to RSS/Atom syndication and aggregation. In this Section we first introduce the ROSES Data Model. Afterwards we describe the Logical Algebra accompanied by its operator properties, which will be used later in the normalization and optimization processes.

Data model

The ROSES Algebra A ROSES is defined by a domain of data D and a set of operations O: A ROSES = (D,O). D contains two types of data: streams (S) and windows (W): D = (S,W). We define a ROSES stream as a full-fledged datastream of annotated ROSES items. More precisely, a ROSES stream S 2 S is a (possibly infinite) set of ROSES elements e = (t, i,A), where:
• t is a timestamp: t belongs to a discrete time domain T (as proposed in [BDE+97]) and represents the System acquisition timestamp of item i. For simplicity and without loss of generality, we map this timestamp discrete domain to the set of natural numbers N (T = N), i.e. to successive system acquisition moments we associate successive timestamps values in N [SW04].
• i is a ROSES item, which will be described in more detail below.
• A is an annotation set, referring to all items that have been joined with item i. More formally, a ROSES annotation is a set of couples (j, I), where j is a join identifier and I is a set of items. Annotations are produced by join operations, thus streams not resulting from joins have an empty annotation set. The semantics of annotation join is further detailed in the next Section.
Moreover, a ROSES stream presents the following two properties:
• the set of elements e for a given timestamp t is finite, and.
• i behaves as a key in the set of elements e = (t, i,A) for a given stream S, i.e. 8(t, i,A) 2 S : @(t0, i,A0) 2 S, t 6= t0.

READ Discussion about the Concentration of the Partition Function

Snapshot-reducibility and rewriting rules

The snapshot-reducibility is a well-known concept in the temporal database community [SJS00]. It guarantees that the semantics of a relational, non-temporal operator is preserved in its more complex, temporal counterpart. In other words, a temporal operator may be emulated by applying its relational counterpart at every time instant. More precisely, for all t 2 T the set of items associated to time t and produced by a temporal operator is equal to the set of items produced by the analogous relational operator on the set of the input items associated to the same t [BBJ98]: if opT (S) = S0 and 8t 2 T : S0(t) = opR(S(t)) , opT is snapshot-reducible (2.7).
Our temporal operators of selection, union and join are snapshot-reducibles, thus the algebraic properties they have in the relational model may be exported to the temporal one. Recall that our annotation join operator do not alter the content of its input items, but it only add the references to the matching items. Further, the predicates of the selection operators do not concern the items annotated by joins, this makes it possible to commute joins and selections. Table 2.3 lists some of the algebraic equivalences derived from the extended relational model. We will see later their interest in order to normalize the publication queries (in Section 3.3), and then to reorganize the query execution plans and find the optimal ones (in Section 3.4).

The VCA predicate factorization algorithm

The need to exhaustively search for minimal Steiner subtrees makes previous algorithm unviable on large subsumption graphs. For this reason we have devised a new approximate algorithm for minimal-cost Steiner Trees, called VCA (Very Clever Algorithm), see Algorithms 3.5 and 3.6. This algorithm takes into account the peculiarities of predicate subsumption graphs and the corresponding cost-model in ROSES. VCA is a greedy algorithm, iteratively improving an existing filter plan according to a local heuristic which estimates the benefit of adding new intermediate predicates. VCA starts from an initially correct plan where the root (predicate true) is directly connected to each predicate in the set Preds of target predicates, e.g. Preds = {a ^ b, a ^ b ^ c, b ^ c ^ d, d ^ e} in Figure 3.7. This corresponds to a plan where all filtering predicates are evaluated independently on the source feed and whose cost is proportional to the number of predicates (the selectivity of root true multiplied by the number of filter predicates).
We note Border the set of children of the root in the current plan, i.e. Border = Preds at the beginning. The Border essentially contains the set of vertices to be potentially factorized at each iteration step of the VCA algorithm. For each Border we consider a set of candidates for predicate factorization, denoted by Cands, containing nodes that subsume at least two predicates in the current border. Note that the intersection between Border and Cands is not necessarily empty, e.g., in Figure 3.7 a^b belongs at the beginning to both Border and Cands (it subsumes border nodes a ^ b, itself, and a ^ b ^ c).
Based on these two sets, VCA iteratively tries to replace with some node in Cands all the nodes it subsumes in current Border. Namely, VCA checks if adding a node n 2 Cands in the tree is beneficial. In such case, the algorithm replaces by n all the nodes in the Border that are subsumed by n. The cost of the final tree obviously depends on the choice of these candidates and it is guided by two measures:
1. the selectivity of the candidate predicates and.
2. the number of existing predicates they factorize.

Table of contents :

List of Figures
List of Tables
List of Algorithms
Introduction
1 State of the Art
1.1 Existing RSS aggregation tools
1.2 Pub/sub systems
1.3 Datastream Management Systems
1.3.1 STREAM
1.3.2 Aurora
1.3.3 PIPES
1.3.4 TelegraphCQ
1.3.5 XML-stream query engines
1.4 Multi-query optimization problem
1.4.1 RUMOR
1.4.2 Widom et al.
1.4.3 Liu et al.
1.4.4 Conclusion
2 ROSES Query Language and Logical Algebra
2.1 ROSES query language
2.1.1 Registering sources
2.1.2 Publication language
2.1.3 Subscription language
2.2 Data model and logical algebra
2.2.1 Data model
2.2.2 Logical algebra
2.2.2.1 Logical operators
2.2.2.2 Snapshot-reducibility and rewriting rules
3 Multi-query Processing and Optimization
3.1 Query processing and cost model
3.1.1 Multi-query graphs
3.1.2 Query processing and cost model
3.2 Multi-query optimization problem
3.3 Query normalization
3.3.1 Query logical model and query normalization
3.3.2 Global normal query graph
3.4 Factorization algorithms
3.4.1 Query factorization
3.4.2 The factorization algorithms
3.4.2.1 The STA algorithm
3.4.2.2 The VCA predicate factorization algorithm
3.4.2.3 Finding the best candidates with VCB
3.5 Runtime optimization
3.5.1 Runtime optimization strategy
3.5.2 When to recompute the filtering trees
3.6 Experimental evaluation
3.6.1 The ROSES query generator
3.6.2 Experiments
3.6.2.1 Experience I: Conjunctive queries
3.6.2.2 Experience II: Complex queries
3.6.2.3 Experience III: Multiple sources
3.6.2.4 Experience IV: Cost model validation
4 ROSES System Architecture and Prototype
4.1 ROSES system architecture
4.1.1 Acquisition module
4.1.2 Evaluation module
4.1.3 Dissemination module
4.2 Prototype implementation details
4.3 Overview of the ROSES client functionalities
4.3.1 The ROSES Query Builder web application
Conclusion and Future Research Directions
Appendices
A.1 Complete extended-BNF grammar for ROSES query language
A.2 properties.xml: the ROSES configuration file
A.3 Résumé en français
A.3.1 Le langage ROSES
A.3.2 Modèle de Données et Algèbre
A.3.3 Évaluation de Requêtes
A.3.4 Optimisation de requêtes
A.3.5 Conclusion et Perspectives
Bibliography References