The query processing unit: a building block for composable query processing architectures .

Get Complete Project Material File(s) Now! »

Query processing system performance evaluation

The aspects of a query processing system’s performance can be categorized in two groups: eciency and eectiveness [33]. We can measure eciency with metrics such as response time, throughput, and scalability. Eectiveness is a measure of how well a query processing system achieves its intended purpose. It involves metrics such as precision (the fraction of useful information returned by the query) and recall (the fraction of data items in the corpus that satisfy a query returned by the query).
Finally, two other important factors in the design of query processing systems are availability and operational cost. Availability is important, due to the negative eects of downtime in user serving systems.
It is especially relevant in the design of distributed query processing systems due to the multitude of faults that can impact the operation of a distributed system [78]. Moreover, the changes in system infrastructure brought by the cloud infrastructure model (ne-grained billing, independent scaling of dierent resources) allow more cost-eective system designs [143].

Evaluating Eciency

In this work, we focus on applications that issue interactive queries, in which the most visible aspect of eciency is the response time experienced by a client between issuing a query and receiving the corresponding response. Query processing systems that serve such applications need to be able to process large volumes of user requests, while keeping the response time of individual requests low. Because of that, an important eciency metric is how response time scales as the system’s throughput increases. The relation between response time and throughput, as well as between the oered load and the throughput achieved by the system, characterize the query processing system’s scalability.

Response time

Response time, the delay between making a request and receiving the corresponding response, is among the most important metrics for the quality of a user-facing service.
A number of studies and experiments have studied the eects of response time to user experience.
Results show that response time is among the factors with the most signicant eect in users’ subjective perception of the quality of a system. Users have been shown to perceive websites that load faster as more interesting [112]. On the other hand, long response times increase user frustration [39] and even compromise user’s conceptions of the security of the system [28]. Industry reports have indicated that even small increases in user-perceived response times can result in drops in web trac, and therefore sales. Experiments by the Google and Bing search engines have shown that longer page loading times have a signicant impact on metrics such as time to click, repeated site usage, and queries per visit. A study from Akamai on the impact of travel site performance on consumers showed that more than half of the users will wait three seconds or less before abandoning the site [9].

Evaluating Eectiveness

Eectiveness is a measure of how well a query processing system achieves its intended purpose. In the eld of information retrieval, which covers the problems associated with searching human-language data, the key notion linked with eectiveness is relevance [33]. In information retrieval, given a user’s information need, represented by a search query, the search engine (the system responsible for query processing) computes a relevance score for each document (e-mail message, webpage, news article), and returns a ranked list of results.
Recall and precision are metrics often used to quantify the relevance of query results:
Recall is the fraction of the relevant documents that are returned by the query. A recall value equal to 1 indicates that all relevant documents are returned by the query A recall value of less that 1 indicates that some relevant documents are not returned (\false negatives »).
Precision is the fraction of relevant documents among the documents contained in the query result.
A precision value equal to 1 indicates that all documents returned by the query are relevant. A precision value of less than 1 indicates that some of the returned documents are not relevant (\false
positives »).
The dierence between the information retrieval query model, and the query model that we consider in this work is in the value space of relevance. In information retrieval, relevance is a spectrum: Documents can be more or less relevant to a given query. Information retrieval systems assign a score to each document for a given query. In the scope of this work, we consider relevance as a binary metric: A data item is either relevant (satises the given query) or it is not. However, as described in Section 2.2.2.2, query results can, similarly to information retrieval, include non-relevant results (false positives), or not include relevant results (false negatives). Therefore, we argue that the recall and precision are meaningful metrics for evaluating the eectiveness of the query processing tier.

Freshness

Traditional database systems often keep derived state consistent with the corpus by updating both in a single transaction. For example, when executing an UPDATE statement, MariaDB updates a table’s secondary indexes in the same transaction as the table rows [91]. However, in systems that implement asynchronous (lazy) derived state maintenance policies [111, 130, 135] derived state can become stale with respect to corpus.
Stale derived state may introduce false positives and false negatives to query results:
False positives: A data item d that satised a query q has been deleted (or updated so that it does not satisfy q), but the corresponding derived state has not yet been updated to re ect this change. Serving q by reading from the stale derived state, includes d in response of q, introducing a false positive.
False negatives: A data item d that satised a query q has been created (or updated so that it satises q), but the corresponding derived state has not yet been updated to re ect this change. Serving q by reading from the stale derived state, does not include d in response of q, introducing a false negative.
False positives and false negatives aecting the recall and precision of query processing. We use the notion of freshness to refer to the measure of consistency between corpus and derived state A number of metrics for measuring data freshness have been proposed in the literature [30]:
Currency measures the time between a change in the source data and that change being re ected in the derived state. In caching systems, the terms recency [31] and age [42] have been used to describe this metric.
Obsolescence measures the number of updates to source data since derived state was last updated. Work on query systems has dened the obsolescence cost [60] of a query to represent the penalty of basing a query result on obsolescent materialized view. This cost is computed as a function of the number of insertions, updates, and deletions that cause deviation between the materialized view and the base table.
Freshness-rate measures the percentage of derived state entries that are up-to-date with the source data. This metric has been used to quantify the freshness of web pages [81] and local databases copies [42].

Query Processing in Relational Database Systems

The database component responsible for query processing is the query processor. Given a declarative query (e.g. written in SQL), the role of the query processor is to validate it, optimize it into a procedural data ow execution plan, and execute that plan.
Query processing consists of three main phases [72, 80]. First, the query processor parses the given query and generates a parse tree that represents the query’s structure. Second, the query processor performs semantic analysis in order to transform the parse tree into a relational algebra expression tree. Finally, the query processor produces a query execution plan, which indicates the operations to be performed, the order in which they are to be evaluated, the algorithm chosen for each operation, and the way intermediate results are to be passed from one operation to another. We describe in more detail the phases involved in query processing below. Query Parsing. The goal of the query parsing phase is to translate a given query into an internal representation that can be processed by later phases, commonly a tree of relational operators [131]. While building the internal representation of the query, the query processor checks the query’s syntax, veries that the relation names that appear in the query are valid names of relations in the database, and veries that the user is authorized to execute the query.
Query Rewrite. In the query rewrite phase, the query processor transforms the internal representation of the query in order to carry out optimizations that are benecial regardless of the physical state of the system (the size of tables, presence of indices, locations of copies of tables etc.). Typical transformations include:
Elimination of redundant predicates and simplication of expressions: This includes the evaluation of constant arithmetic expressions, short-circuit evaluation of logical expressions via satis ability tests, and using the transitivity of predicates to induce new predicates. Adding new transitive predicates increases the ability of the following phase (query optimization) to construct query plans that lter data early in execution, and make use of index-based access methods.
View expansion, sub-query un-nesting: For each view reference that appears in the query, the query processor retrieves the view denition and rewrites the query to replace that view with its denition. In addition, this phase attens nested queries when possible.
Semantic optimization: In many cases, integrity constrains dened by the schema can be used
to simplify queries. An important such optimization is redundant join elimination (for example, a query that joins two tables but does not make use of the columns of one of the tables). Query Optimization. In the query optimization phase, the optimizer (the query processor component responsible for query optimizations) transforms the internal query representation into an ecient plan for executing the query.
The optimizer is responsible for decisions such as which indices to use to execute a query, which algorithms to use to execute the join operations, and in which order to execute a query’s operations. In a distributed system, the optimizer also decides about the placement of computations across the systems.
The foundational paper by Selinger et al. on System R [123] decomposes the problem of query optimization into three distinct sub-problems: cost estimation, relational equivalences that dene a search space, and cost-based search. The optimizer assigns a cost estimate to the execution of each component of a query, measured in terms of I/O and CPU cost. To do so, the optimizer relies on pre-computed statistics about the contents of each relation, and heuristics for determining the cardinality of the query output.
It then uses a dynamic programming algorithm to construct a query execution plan based on these cost estimates.
Query Execution All query processing algorithm implementations iterate over the members of their input sets. In [65], Graefe models these algorithms as algebra operators consuming zero or more inputs and producing one or more outputs.
A query execution engine | the query processor’s component responsible for executing the query execution plan | consists of a collection of operators, and mechanisms to execute complex expressions using multiple operators. Query engines execute query plans by pipelined query operators; The output of one operator is streamed into the next operator without materializing the intermediate query results. An advantage of this model is that all iterators have the same interface; As a result, a consumer-producer relationship can exist between any two iterators; Operators can, because of that, be combined into arbitrarily complex query evaluation plans.
There are two approaches for implementing pipelining. traditionally, query execution engines have implemented demand-driven pipelining: Each time an operator needs an record it pulls the next record from its input operator, and wait input that operator produces the a record. That input operator might in turn require itself a record from its input operator, and so on. This approach has been popularized by its used in the Volcano system [64].
Conversely, in data-driven pipelining, records are pushed from source operators towards destination operators. This is commonly used in streaming systems.
Since query execution plans are algebra expressions, they can be expressed as trees; Tree’s nodes are operators and edges represent consumer-producer relationships between operators. More generally, for queries with common sub-expressions, the query execution plan is a directed acyclic graph [65].
The concept of implementing a query execution engine as a directed acyclic graph of relational algebra operators, each executing a basic operation, forms the basis of our query engine architectural model. In Chapter 5, we describe how this concept can be generalized to include stateful operators that implement derived state structures, including indexes, materialized views and caches, as well as \meta-operators ». We characterize as meta-operators (or routing) those operators that perform query processing control tasks, such as managing index partitions and load balancing, rather data manipulation tasks.

READ SELECTIVE LASER MELTING

Query Optimization and Materialized Views

Materialized views add further consideration to query optimization:
Rewriting queries to use materialized views. The query processor may produce a more ecient query plan by rewriting the query to make use of an available materialized view.
Replacing the use of a materialized view with its denition. In some cases, replacing the materialized view with its denition in a given query, rather than directly reading from the view’s contents, may oer more optimization options. For example, consider a case in which a materialized view does not include indexes that can be used to speed up a certain query, but the underlying relations do. Using the views denition instead of its contents enables query execution to take advantage of those indexes.

Materialized View Selection

Materializing an appropriate set of views and processing queries using these views can signicantly speed up query processing since the access to materialized views can be much faster than recomputing the views. In principle, materializing all queries that a system may receive can achieve the optimal query response time. However, maintaining a materialized view incurs a maintenance cost. In addition, query results may be too large to t in the available storage space. There is therefore a need for selecting a set of views to materialize by taking into account query processing cost, view maintenance cost and storage space. The problem of choosing which views to materialize in order to achieve a desirable balance among these three parameters is known as the view selection problem [69, 92].

Table of contents :

1 Introduction
1.1 Contributions
1.2 Thesis outline
2 Preliminaries
2.1 System Model
2.1.1 Data storage tier
2.1.1.1 System model
2.1.1.2 Data model
2.1.1.3 Data storage tier API
2.1.2 Query processing tier
2.1.3 Query language
2.1.4 Derived state
2.2 Query processing system performance evaluation
2.2.1 Evaluating Eciency
2.2.1.1 Response time
2.2.2 Evaluating Eectiveness
2.2.2.1 Recall and precision
2.2.2.2 Freshness
2.2.3 Other aspects of query processing system design
2.2.3.1 Availability
2.2.3.2 Operational Cost
2.3 Conclusion
3 Background
3.1 Query Processing in Relational Database Systems
3.1.1 Materialized Views
3.1.1.1 View maintenance
3.1.1.2 Query Optimization and Materialized Views
3.1.1.3 Materialized View Selection
3.1.2 Distributed Query Processing
3.1.3 Caching
3.2 Query Processing in Non-Relational Database Systems
3.2.1 Non-relational Database Data models
3.2.2 Partitioning
3.2.3 Replication
3.2.4 Query Processing
3.2.4.1 Secondary indexes
3.2.4.2 Partitioning and Secondary Indexes
3.2.4.3 Query Planning and Execution
3.3 Conclusion
4 The design space of geo-distributed query processing
4.1 The use of derived state in query processing
4.2 Design decisions and trade-os in derive state based query processing systems
4.2.1 Derived state maintenance schemes
4.2.2 Derived state partitioning
4.2.3 Derived state placement
4.2.3.1 Geo-replication
4.2.3.2 Geo-partitioning
4.2.3.3 Multi-cloud and Query federation
4.3 Conclusion
5 A design pattern for exible query processing architectures
5.1 Overview and design rationale
5.2 The query processing unit: a building block for composable query processing architectures .
5.2.1 The Query Processing Unit component model
5.2.1.1 Query, input stream, and domain interfaces
5.2.1.2 Conguration
5.2.1.3 Local graph view
5.2.1.4 Query Processing State
5.2.1.5 Initialization, query processing, and input stream processor methods
5.2.2 Query Processing Unit component model specication
5.2.2.1 Query interface
5.2.2.2 Query processing state
5.2.2.3 Initialization method
5.2.2.4 Query processing method
5.2.2.5 Input stream processor method
5.2.3 Stream semantics
5.2.4 QPU classes
5.2.4.1 QPU class case studies
5.3 QPU-based query processing systems
5.3.1 Query processing system architecture
5.3.1.1 QPU-graph topology rules
5.3.2 Query execution
5.3.3 Query execution data structures
5.3.3.1 Query parse tree
5.3.3.2 Domain tree
5.4 Discussion
5.4.1 QPU-based query processing systems and ad-hoc queries
5.4.2 Query processing unit consistency semantics
5.5 Conclusion
6 Case studies
6.1 Flexible secondary index partitioning
6.1.1 Write path
6.1.1.1 Graph topology and placement
6.1.1.2 Index partition conguration and graph initialization
6.1.2 Read path
6.2 Federated secondary attribute search for multi-cloud object storage
6.2.1 Predicate-based indexing
6.3 Materialized view middleware
6.3.1 Partial materialization
6.3.2 Placing materialized views at the edge
7 Proteus: Towards application-specic query processing systems
7.1 Query processing domain dissemination
7.1.1 Domain interface
7.1.2 Query processing domain discovery mechanism
7.2 Query processing unit service
7.2.1 QPU service architecture
7.3 Architecture specication language
7.4 Query processing system deployment
7.5 Implementation
8 Evaluation
8.1 Placing materialized views at the edge
8.1.1 Experimental scenario
8.1.2 Experimental Setup
8.1.3 Query processing performance
8.1.4 Freshness
8.1.4.1 Freshness vs Throughput
8.1.4.2 Freshness vs round-trip latency
8.1.5 Data transfer between sites
8.1.6 Conclusion
8.2 Federated metadata search for multi-cloud object storage
8.2.1 Experimental scenario
8.2.2 Methodology
8.2.3 Experimental Setup
8.2.4 Query processing performance
8.2.5 Freshness
8.2.6 Data transfer between storage locations
8.2.7 Conclusion
8.3 Conclusions
9 Related work
9.1 Secondary indexing in distributed data stores
9.1.1 Consistency between corpus and index
9.2 Materialized views
9.3 Result caching
9.4 Geo-distributed query processing
9.5 Modular & Composable architectures
9.6 State and computation placement
9.6.1 Computation placement
9.6.2 State placement
9.7 Distributed computation models
9.7.1 MapReduce
9.7.2 Data ow engines
10 Future Work and Discussion
10.1 Future Work
10.1.1 Placement policies
10.1.2 Dynamic query engine architectures
10.1.3 Consistent distributed queries
10.2 Discussion
10.2.1 Scope
10.2.2 Fault Tolerance
10.2.3 Data storage tier APIs
10.2.4 Derived state in geo-replicated databases
11 Conclusion