a Distributed Stream Data Cleaning System

Get Complete Project Material File(s) Now! »

Large-scale Data Analysis Systems

Over the last decade, numerous large-scale data analysis systems have emerged to address the big data challenge. Unlike the traditional DBMS which runs in a single machine, these systems run in a cluster with a collection of machines (nodes) in a Shared Nothin Architecture [4] where all machines are connected with a high-speed network and each has its own local disk and local main memory [5], as shown in Figure 2.1. To achieve parallel processing, these systems divide datasets to be analysed into partitions which are distributed on dierent machines.
Structured Query Language (SQL) is the declarative language most widely used for analyzing data in large-scale data analysis systems. Users can specify an analysis task using an SQL query, and the system will optimize and execute the query. To clarify, we focus on systems for large-scale analysis, namely, the eld that is called Online Analytical Processing (OLAP) in which the workloads are read-only, as opposed to Online Transaction Processing (OLTP).
We classify these systems into two categories, Parallel DBMSs and SQL-on-Hadoop Systems, based on their storage layers: Parallel DBMSs store their data inside their database instances, while in SQL-on-Hadoop Systems, data is kept in the Hadoop distributed le system.

Parallel DBMSs

Parallel DBMSs are the earliest systems to make parallel data processing available to a wide range of users, in which each node in the cluster is a database instance. Many of these systems are inspired by the research work Gamma [7] and Teradata [8]. They achieve high performance and scalability by distributing the rows of a relational table across the nodes of the cluster. Such a horizontal partitioning scheme enables SQL operators like selection, aggregation, join and projection to be executed in parallel overdierent partitions of tables located in dierent nodes. Many commercial implementations are available, including Greenplum [9], Netezza [10], Aster nCluster [12] and DB2 Parallel Edition [27], as well as some open source projects such as MySQL Cluster [11], Postgres-XC [13] and Stado [14].
Some other systems like Amazon RedShift [15], ParAccel [16], Sybase IQ [25] and Vertica [26], vertically partition tables by collocating entire columns together instead of collocating rows with a horizontal partitioning scheme. When executing user queries, such systems can more precisely access the data they need rather than scanning and discarding unwanted data in rows. These column-oriented systems have been shown to use CPU, memory and I/O resources more eciently compared to row-oriented systems in large-scale data analytics.
For parallel DBMSs, data preparation is always a mandatory and time-consuming step. Data cleaning must be performed in advance to guarantee the quality of data. As parallel DBMSs are built on traditional DBMSs, they all require data to be loaded before executing any queries. Each datum in the tuples must be parsed and veried so that data conform to a well-dened schema. For large amounts of data, this loading procedure may tak a few hours, even days, to nish, even with parallel loading across multiple machines. Moreover, to better benet from a number of technologies developed over the past 30 years in DBMSs, parallel DBMSs provide optimizations like indexing and compression, which also necessitate the phase of data preprocessing.

SQL-on-Hadoop Systems

A milestore in the big data research is the MapReduce framework [45], which is the most popular framework for processing vast datasets in clusters of machines, mainly because of its simplicity. The open-source Apache Hadoop implementation of MapReduce has contributed to its widespread usage both in industry and academia. Hadoop consists of two main parts: the Hadoop distributed le system (HDFS) and MapReduce for distributed processing. Instead of loading data into the DBMSs, Hadoop users can process data in any arbitray format in situ as long as data is stored in the Hadoop distributed le system.
Hadoop Basics: The MapReduce programming model [34] consists of two functions: map(k1; v1) and reduce(k2; list(v2)). Users implement their processing logic by specifying customized map and reduce functions. The map(k1; v1) function accepts input records in the form of key-value pair and outputs intermediate one or more key-value pairs of the form [k2; v2]. The reduce(k2; list(v2)) function is invoked for every unique key k2 and corresponding values list(v2) from the map output, and outputs zero or more key-value pairs of the form [k3; v3] as the nal results. The transferring data from mappers to reducers is called the shue phase, in which users can also specify a partition(k2) function for controlling how the map output key-value pairs are partitioned among the reduce tasks. HDFS is designed be resilient to hardware failures and focuses on high throughput of data access. A HDFS cluster employs a master-slave architecture consisting of a NameNode (the master) and multiple DataNodes (the slaves). The NameNode manages the le system namespace and regulates client access to les, while the DataNodes serve read and write requests from the clients. In HDFS, a le is split into one or more blocks which are replicated to achieve fault tolerance and stored in a set of DataNodes. SQL query

Table of contents :

Abstract
Résumé
Acknowledgments
Table of Contents
1 Introduction
1.1 Context
1.2 Data Preparation
1.2.1 Data Loading
1.2.2 Data Cleaning
1.3 Contributions
1.3.1 DiNoDB: an Interactive-speed Query Engine for Temporary Data
1.3.2 Bleach: a Distributed Stream Data Cleaning System
1.4 Thesis Outline
2 Background and Preliminaries
2.1 Large-scale Data Analysis Systems
2.1.1 Parallel DBMSs
2.1.2 SQL-on-Hadoop Systems
2.2 Streaming Data Processing
2.2.1 Apache Storm
2.2.2 Spark Streaming
3 DiNoDB: an Interactive-speed Query Engine for Temporary Data
3.1 Introduction
3.2 Applications and use cases
3.2.1 Machine learning
3.2.2 Data exploration
3.3 DiNoDB high-level design
3.4 DiNoDB I/O decorators
3.4.1 Positional maps
3.4.2 Vertical indexes
3.4.3 Statistics
3.4.4 Data samples
3.4.5 Implementation details
3.5 The DiNoDB interactive query engine
3.5.1 DiNoDB clients
3.5.2 DiNoDB nodes
3.5.3 Fault tolerance
3.6 Experimental evaluation
3.6.1 Experimental Setup
3.6.2 Experiments with synthetic data
3.6.2.1 Random queries (stressing PM)
3.6.2.2 Key attribute based queries (stressing VI)
3.6.2.3 Break-even point
3.6.2.4 Impact of data format
3.6.2.5 Impact of approximate positional maps
3.6.2.6 Scalability
3.6.3 Experiments with real life data
3.6.3.1 Experiment on machine learning
3.6.3.2 Experiment on data exploration
3.6.4 Impala with DiNoDB I/O decorators
3.7 Related work
3.8 Conclusion
4 Bleach: a Distributed Stream Data Cleaning System
4.1 Introduction
4.2 Preliminaries
4.2.1 Background and Denitions
4.2.2 Challenges and Goals
4.3 Violation Detection
4.3.1 The Ingress Router
4.3.2 The Detect Worker
4.3.3 The Egress Router
4.4 Violation Repair
4.4.1 The Ingress Router
4.4.2 The Repair Worker
4.4.3 The Coordinator
4.4.4 The Aggregator
4.5 Dynamic rule management
4.6 Windowing
4.6.1 Basic Windowing
4.6.2 Bleach Windowing
4.6.3 Discussion
4.7 Rule Dependency
4.7.1 Multi-Stage Bleach
4.7.2 Dynamic Rule Management
4.8 Evaluation
4.8.1 Comparing Coordination Approaches
4.8.2 Comparing Windowing Strategies
4.8.3 Comparing Dierent Window sizes
4.8.4 Dynamic Rule Management
4.8.5 Comparing Bleach to a Baseline Approach
4.9 Related work
4.10 Conclusion
5 Conclusion and FutureWork
5.1 Future Work
5.1.1 Stream Holistic Data Cleaning
5.1.2 Unied Stream Decorators
A Summary in French
A.1 Introduction
A.1.1 Contexte
A.1.2 Data Preparation
A.1.2.1 Chargement des Données
A.1.2.2 Nettoyage des Données
A.1.3 Contribution
A.1.3.1 DiNoDB: un moteur de requête à vitesse interactive pour les données temporaires
A.1.3.2 Bleach: un système de nettoyage de données de ux distribué
A.2 DiNoDB: un moteur de requête à vitesse interactive pour les données temporaires
A.2.1 Introduction
A.2.2 Applications et cas d’utilisation
A.2.3 DiNoDB conception de haut niveau
A.2.4 DiNoDB I/O decorators
A.2.5 Le moteur de requête interactif DiNoDB
A.2.5.1 Clients DiNoDB
A.2.5.2 Noeuds DiNoDB
A.2.6 Conclusion
A.3 Bleach: un système de nettoyage de données de ux distribué
A.3.1 Introduction
A.3.2 Dés et buts
A.3.3 Détection de violation
A.3.4 Réparation de violation
A.3.5 Conclusion
A.4 Conclusion .