Get Complete Project Material File(s) Now! »

## Data Completeness Overview

Given the impact of data completeness on the accuracy and reliability of the analysis, contributions regarding data completeness have a long history. The first interest could be traced back to statistics, where mathematical models were proposed to study the distributions of variables and detect phenomena related to missing data [Afi+66]. In databases, an early representation that extended the relational model for identifying missing values is Null values. It has been quickly shown that this system has not enough expressiveness for addressing the missing data representation problem [Gra77]. In this section, we describe different steps in data completeness studies.

Tackling data completeness is not a single step operation. It requires a pipeline of tasks that allow understanding underlying features, before providing an efficient solution. We retrieve in the literature three main categories of research studies for data completeness: 1- Representation systems( or models), that achieve missing data identification and modeling, 2- Explanatory studies, using correlation-based techniques for detecting the missingness origin and behavior and 3- Cleaning strategies, mostly data enrichment models (figure 2.3).

### Data Completeness Representation Models

The interest in representing missing information in databases is as old as the domain itself. The belief that a database contains a complete representation of the world it observes has early intrigued researchers. Data may be missing in two ways: attribute values could be missing, or entire tuples are missing from the database.

The first contribution in missing values representation is the Null values formalized by Imielenski and Lipski [Imi+88b]. Widely used since, this unnormalized value indicates the absence of a value for an attribute and allows consequently computing query answers without ignoring incomplete tuples, and avoiding considering the default 0 which could distort data distribution and false results (aggregations for example). Many works followed, adopting this representation system to build query evaluation systems [Cod79; Imi+84], or quality assessment metrics [Rei86]. Additional missing values were proposed stating the weakness of the Null values representation system. We discuss these ideas in the next subsection.

The assumption that a database includes every observed fact from the real world and can only miss attribute values is known as the Closed World Assumption. In other terms, with CWA, « what is not known to be true must be false ». If we search in an airline company database for a flight linking Paris to New York, we should expect to find all flights ensured by this company. The price of a flight, its duration, or other attribute values could be missing, but all flights are indicated. It turns out that in practice the CWA is rarely checked [Mot89]. In many cases, the database does not represent the whole information but includes only a subset of tuples. In contrast to CWA, this assumption is called Open World Assumption shortened OWA. Suppose we took a flight to New York, and at landing, we want to book a hotel. It is hard to find a database that includes all hotels in New York and its area. It is more practical to accept that a database may miss additional records. However, under this setting, any study becomes much more complicated to conduct considering missing tuples. Indeed, no precise characterization of the real world can state what is complete or missing in the available database, since we ignore the extent of what is outside. Works such as possible worlds theory [Abi+87] try to encounter this lack of knowledge about the ideal representation. In the next sections, we review in more depth representation models for studying missing values 2.4.1 and missing tuples2.4.2.

#### Missing Values Representation

The missing values representation in databases motivated a high number of research contributions. It has been the first exploration track for extending the relational model designed by Codd [Cod70], for increasing the data model expressiveness. The null value representation system suffers from several limits [Mey98]. Null values do not allow expressing constraints on missing data, or equality/inequality between multiple null values semantic.

The earliest work considering extending Codd’s model was a Codd himself proposal [Cod79]. He suggests using a special symbol ’@’ for characterizing any field with unknown value, creating the Codd tables. The semantic of a Codd table could be interpreted as a database of multiple instances of the same relation, each obtained by replacing ’@’ by any value of the corresponding attribute domain. Ilmienski and Lipski [Imi+84] show that such a system does not allow expressing join queries1. Codd tables associate for a ’@’ symbol occurrence a single interpretation. All variables are distinct, and a variable cannot be indicated for different fields with the same value in one instance, which partially explains the expressiveness limitation. Codd tables are usually related to the semantic of Null values usage in SQL. V -tables overcome the repetition issue stated in Codd-tables, which allow for expressing join queries. Indeed, if two attribute values are marked as missing but with equality, this allows for evaluating the join on this attribute. This extension still does not strongly cover the SPJ fragment of the relational algebra following Ilmienski’s criteria. Ilmienski addresses Codd’s and v-tables representation issues by proposing a new system that supports defining constraints over missing values representations: C-tables. He demonstrates that this system is a strong system for SPJUDR queries (Select-Project-Join-Union-Difference-Rename fragment of the relational Algebra). The C-tables include a condition column that expresses the condition on the tuple with a marked null value. This extension offers the possibility to evaluate fully join queries.

We refer to [Mey98] that establishes a comparative table assessing the expressive power of different representation systems extending the relational model, following the study in [Imi+84]. Table 2.1 summarizes a comparison between these systems representation expressiveness. As explained earlier, c-tables achieve the largest extension for the relational model.

**Relative Information Model**

Our contribution falls within the setting of relative information completeness which is a particular setting of the ”Partially-Closed World Assumption” (see Chapter 2) in which the ideal database is no longer virtual but materialized. We start by defining the notion of Constrained Tables which is essential in this chapter.

**Constrained Tables**

The main extension that distinguishes our data model compared to the relational model is the possibility to define reference tables for representing completeness constraints over data tables. Definition 3.1 (Reference and Constrained Tables). Let D and R be two relational tables and A the set of attributes of R. If A is a key in table D, table R is called a reference table for data table D and the pair T = (D, R) is called a constrained table. In the sequel, given a table T , we denote with Atts(T ) the set of attributes of T .

A reference is either readily available or may be obtained by querying an existing database. For example, when studying the spatio-temporal completeness of some data, the reference is the spatial and temporal coverage of data, as illustrated in the following examples. One may define different references based on the purpose of the analysis that is carried out while respecting the condition that the schema of the reference table must coincide with the schema of the data table restricted to its primary key attributes.

Example 3.1 (Constrained tables). Consider Table 3.1 which presents several data tables with spatial or temporal attributes, and a data table M easures, whose completeness is to be analyzed w.r.t. to different possible references. The primary key of M easures is {F loor, Room, W eek, Day, KW H}. Consider the following candidate reference tables for the M easures defined as follows:

• R1 = Map × Weeks × Days, which contains all combination of localities and days.

• R2 = ΠF loor,Room Meeting × Weeks × Days, restricts to room R1 in both floors.

• R3 = ΠF loor,Room Sensor, considers an external table defining spatial locations of sensors.

Only R1 and R2 whose schema is {Floor, Room, Week, Day} can be considered as a reference for

M easures while R3 which does not contain the temporal attributes Week and Day can not be a reference for this table. However, R3 can serve a reference for πW eek,DayM easures = {(F 1, R1), (F 1, R2), (F 2, R1)} since Atts(R3) = {W eek, Day}.

Incompleteness may come in different flavors: missing tuples or missing values. Under some restrictions, the notion of constrained tables defined above can be adapted to the case of missing values. Precisely, when in a constrained table T = (D, R), πAtts(R)D is a key for D, one can extend D with all tuples from R − D while assigning Null to the Atts(D) − Atts(R) .

**Assessing Data Completeness**

The notion of constrained tables provides the means for assessing completeness of data in a rather intuitive way as presented below.

Definition 3.2 (Table Completeness). A constrained table T = (D, R) with reference attributes A is complete iff R ⊆ πA(D).

Proposition 3.2.1. (Transitivity) Completeness is transitive: if R0 ⊆ R and T = (D, R) is complete, then the constrained table T 0 = (D, R0). The same holds for any D0 such that D ⊆ D0.

Example 3.3 (Table Completeness). Observe data and reference tables in Table 3.1. The Table 3.3 depicts the complete and incomplete constrained tables among candidates.

**Translating Pattern Algebra Expression into SQL**

The pattern algebra computes pattern covers of query results without completely evaluating these queries on the data and their reference tables. All pattern algebra operators except the folding operator can be expressed in the relational algebra (Section 3.5 of Chapter 3). We have also shown that all pattern query expressions can be rewritten into an equivalent expression with a single final folding step to produce a minimal cover result1 We therefore can apply the same rules for 1 Note that the special operator safe projection πb∗ is an exception to this rule since it requires as input a minimal cover to ensure the presence of all generalizations. We have shown in Section 3.4.3 that one solution to this restriction is to implement safe projection by a join query over a pattern table and its complement. translating pattern algebra expressions into SQL as defined for the relational algebra. The following examples illustrates some examples for this translation.

Example 4.1. Consider the constrained data table T =(D, R) and its minimal pattern cover P∗(T ), shown in Table 4.1.

**Table of contents :**

**1 Introduction and Motivation **

1.1 General Context and Motivation

1.2 EBITA and Smart Campus

1.3 Challenges by Example

1.3.1 Challenge #1: Complete and Missing Data Representation

1.3.2 Challenge #2: Query Result Annotation

1.3.3 Challenge #3: Aggregate Queries Correctness

1.3.4 Challenge #4: Aggregate Query Imputation

1.3.5 Challenge #5: Data Fragments Summarization

1.4 Thesis Contributions

1.5 Thesis Outline

**I Relative Completeness Representation**

**2 Data Completeness Representation **

2.1 Introduction

2.2 Data Quality

2.2.1 Taxonomies of Data Quality Problems

2.2.2 Data Quality Dimensions

2.3 Data Completeness Overview

2.4 Data Completeness Representation Models

2.4.1 Missing Values Representation

2.4.2 Missing Tuples Representation

2.5 Summary

**3 Pattern Model and Algebra **

3.1 Introduction

3.2 Relative Information Model

3.2.1 Constrained Tables

3.2.2 Assessing Data Completeness

3.3 The Pattern Model

3.3.1 Partition Patterns

3.3.2 Pattern Semantics

3.3.3 Pattern Covers

3.4 The Pattern Algebra

3.4.1 Pattern Operators

3.4.2 Rewriting Rules and Optimization

3.4.3 Safe Projection

3.5 Pattern Queries

3.6 Independent References

3.7 Summary

**4 Pattern Algebra Implementation and Experiments **

4.1 Introduction

4.2 Translating Pattern Algebra Expression into SQL

4.3 Folding Data

4.4 Folding Patterns

4.5 Experiments

4.5.1 Datasets

4.5.2 Pattern table generation

4.5.3 Pattern Query Processing

4.5.4 Folding pattern query results

4.6 Summary

**II Incomplete Query Result Imputation **

**5 Data and query result imputation techniques **

5.1 Introduction

5.2 Handling the Missing Data Problem

5.3 Data Imputation

5.3.1 Human Based Imputation

5.3.2 Automatic Data Imputation

5.3.3 Summary

5.4 Query-driven Imputation

5.4.1 Approximate Query Processing

5.4.2 Dynamic Imputation

5.4.3 Missing Tuples Impact on Query Results

5.5 Summary

**6 Query Result Imputation for Aggregation Queries **

6.1 Introduction

6.2 Motivation

6.3 Imputation Model

6.3.1 Aggregate Queries and Query Patterns

6.3.2 Imputation Rules and Imputation Queries

6.4 Query Imputation Process

6.4.1 Step 1: Annotating Query Results

6.4.2 Step 2: Generate Candidate Imputations

6.4.3 Step 3: Imputation Strategy

6.4.4 Step 4: Imputation Query Generation

6.5 Implementation

6.5.1 Partition Patterns Classification

6.5.2 Imputation Query SQL Implementation

6.6 Experiments

6.6.1 Query Result Annotation

6.6.2 Query Result Imputation

6.7 Summary

**III Reasoning With Fragment Summaries **

**7 Summarizing and comparing data fragments using patterns **

7.1 Introduction

7.2 Motivation

7.3 Fragment and Summary Model

7.3.1 Data Fragments

7.3.2 Fragment Summaries

7.4 Reasoning with Fragment Summaries

7.4.1 Formal Reasoning Model

7.4.2 Reasoning with Queries

7.5 Experiments

7.6 Related Work

7.7 Summary

**IV Conclusion and Future Work **

**8 Conclusion and perspectives **

8.1 General Conclusion

8.2 Perspectives on the Completeness Model

8.2.1 User-Friendly Interface

8.2.2 Incremental Minimal Covers

8.3 Perspectives on Query Result Imputation

8.3.1 Imputation Quality Model

8.3.2 Imputation Strategy

8.3.3 Shared Query Result Imputations

**Appendix A Résumé en Français **