Learning with Label Noise – Project topics materials

Get Complete Project Material File(s) Now! »

System Overview and Background

In this chapter, we review our design of an active learning-based explore-by-example system, present the essential background on data exploration, explore-by-example, active learning, and label-noise learning, and introduce a recently proposed polytope-based data space model which inspired our work.

System Overview

Our data exploration system is depicted in Figure 2.1. The main concepts and modules are described as follows.
Data space. When a user comes to explore a database, she is presented with the database schema for browsing. Based on her best understanding of the (implicit) exploration goal, she may choose a set of attributes, {Ai}, i = 1, . . . , d, from a table for consideration. These attributes form a superset of the relevant attributes that will eventually be discovered by the system. Let us consider the projection of the underlying table to {Ai}, and pivot the projected table such that each Ai becomes a dimension and the projected tuples are mapped to points in this d-dimensional space – the resulting space is called a data space where the user exploration will take place.
Initial examples. To bootstrap data exploration, the user is asked to give a positive example and a negative example. If she does not have such examples, the system can run initial sampling [31, 58] over the data space to help her find such examples. Since the initial sampling problem has been studied before, our work in this thesis focuses on data exploration after such initial examples are identified.
Iterative learning and exploration. The iterative exploration starts with a given positive example set and a negative example set, which form the initial labeled dataset. In each iteration, the labeled dataset is used to train a user interest model, which is fast due to the small size of training data. Before the model reaches convergence or a user-specified accuracy level, it is used next to explore the data space and retrieve a new example for display. In the next iteration, the user labels this example as positive or negative – such feedback can be collected explicitly through a graphical interface [30], or implicitly based on the user behavior.1 The newly labeled example is added to the labeled dataset, and the above process repeats.
Convergence and final retrieval. At each iteration, our system assesses the current model to decide whether more iterations are needed. The process is terminated when the model accuracy has reached a user-defined threshold or the user stops data exploration. At this point, the model for the positive class is translated to a query which will retrieve from the database all the tuples classified as relevant. To provide better interpretability, our system can visualize the final (nonparametric) model and fit a corresponding parametric model, the form of which can be suggested by the user after seeing the visualization.
Our system aims to (1) achieve accurate results with a minimum number of examples presented to the user (i.e., the amount of user labeling effort) and (2) restrict per-iteration time to within a few seconds. The per-iteration time refers to the time cost for one round of exploration, including classification, convergence detection, and space exploration.

Related Work

We survey some data exploration frameworks and techniques related to our work in Section 2.2.1, describe the active learning-based data exploration elaborately in Section 2.2.2, and eventually provide a survey of label-noise learning in Section 2.2.3.

Data Exploration Frameworks and Related Techniques

Our active learning-based interactive data exploration system aims to help the user in query formulation in an interactive and example-based way while minimizing user labeling effort. We next explore some related data exploration frameworks from different perspectives (interactive data exploration, query formulation, and example-based exploration) and recent techniques for the minimization of user labeling effort (active learning and active search).

Data exploration frameworks

Interactive data exploration has attracted a proliferation of recent interest. Faceted search [9, 52, 75] iteratively recommends query attributes for drilling down into the database, but the user is often asked to provide attribute values until the desired tuple(s) are returned [52, 74, 75] or offer an “interestingness” measure and its threshold [29]. Semantic windows [51] are pre-defined multidimensional shape-based and content-based predicates that a user can explore. These methods are different from our active learning approach. Recent work also supports time series data [65] or avoids false discoveries of statistical patterns [101] during interactive data exploration.
The active learning-based explore-by-example approach offers potential benefits over faceted search and semantic windows for several reasons.
First, the user interest may include varying degrees of complexity, or the user does not have prior knowledge about the complexity and hence expects the system to learn a model as complex as necessary for her interest. More specifically, if the user knows the relevant attributes and just wants to set the appropriate value for each attribute, an interactive GUI or faceted search [9, 52, 75] may be sufficient. However, the user interest may involve more complex constraints, e.g., “(rowc−a1 )2 + (rowc−a2 )2 < c », and “x + 2.5 ∗ log10(y) < 23.3″ b1 b2 from the SDSS example query set [81], or “length ∗width > c » as seen in our car database. If the user knows the function shape and constants in advance, some of the above examples (e.g., the ellipse pattern) can be supported by semantic windows [51] as predefined patterns. However, if the user does not have such prior knowledge, explore-by-example may work regardless of how complex the predicates are and which functions and constants are used in the predicates.
Second, increased dimensionality makes it harder for semantic windows to scale. For example, when the interest of the SDSS user involves both the ellipse and log patterns above, it will be more difficult for both the system and the user to handle multiple patterns for data exploration (even if such patterns can be predefined). In contrast, as the dimensionality increases, explore-by-example can keep the same user interface for data exploration and handles increased complexity via its learning algorithm “behind the scenes”.
Query formulation has been surveyed in [24]. The closest to our work is LifeJoin [25], which we compared in Section 4.3.3. Query By Output (QBO) [88] takes the output of one query on a database, and constructs another query such that running these two queries on the database are instance-equivalent. Dataplay [2] provides a GUI for users to directly construct and manipulate query trees. It assumes that the user can specify the value assignments used in his intended query, which are assumptions that we cannot make, and learns conjunctions of quantified Horn expressions (with if-then semantics) over nested relations [1].
Example-based exploration is a specific framework for data exploration which removes the need for the user to specify a query. Earlier work on example-based exploration focused on a visualization front-end that aims to minimize the user effort to learn the SQL syntax [48, 68]. Recent work [63] proposes exemplar queries which treat a query as a sample from the desired result set and retrieve other tuples based on similarity metrics, but for graph data only. The work [79] considers data warehouses with complex schemas and learns the minimal project-join queries from a few example tuples efficiently. It does not consider selection with complex predicates, which is a focus of our work. The work [50] helps users construct join queries for exploring relational databases, and [54] does so by asking the user to determine whether a given output table is the result of her intended query on a given database.
For a more comprehensive survey of example-based exploration, see [57].
Related techniques for minimizing user labeling effort
The following techniques – active learning and active search – have been widely used to minimize user labeling effort for data exploration.
Active Learning. Tong and Koller [87] provide a theoretical motivation on selecting new examples using the notion of a version space, but with unknown convergence speed. Related to our work is a lower bound on the probability of misclassification error on the unlabeled training set [20]. However, it relies on user labeling of an additional sample from the unlabeled pool, which is not required in our work. Recent papers [34, 41–43] offer probabilistic bounds for the classification error and sample complexity. Our work differs in that we focus on F-score, which suits selective user interest queries (imbalanced classes in classification).
Most learning theory makes no assumptions on convex data/class distributions [44]. Clustering techniques can assume that data is clustered in convex sets [44], but address a different problem from ours. In Active Learning, convexity assumptions occur in the Version Space [11, 87], which is the set of classifiers consistent with training data. Our algorithm can embrace any classifier developed through the version space, but also includes the new polytope model. In Section 2.2.2, we comprehensively show the essential parts of active learning-based data exploration.
Active Search. Active search [39, 59, 90] aims to maximize the number of positive examples discovered, called the Target Set Accuracy, within a limited budget of user labeling effort. In comparison, our work aims to maximize the F-score of the model learned to approximate the true user interest. We reported the performance difference from [39] in the previous section. The works [59, 90] use a kernel function to measure similarity of items in order to maximize the utility of a set of selected items. Such kernel methods have the smoothness requirement, i.e., similar items have similar utility values, and require training data to tune the kernel for each use case (user interest), which do not suit our problem setting.

READ Dynamic Data Identification Strategy

Active Learning for Data Exploration

The problem of dynamically seeking the next example for labeling from a large database of unlabeled tuples based on a few labeled examples is closely related to active learning. The recent results on active learning are surveyed in [78]. Below, we summarize those results most relevant to our work.
Pool-Based Sampling. Many real-world problems fit the following scenario: there is a small set of labeled data L and a large pool of unlabeled data U available. In active learning, an example is chosen from the pool in a greedy fashion, according to a utility measure used to evaluate all instances in the pool (or, if U is large, a subsample thereof). In our setting of database exploration, the labeled data L is what the user has provided thus far. The pool U is a subsample of size m of the unlabeled part of the database. The utility measure depends on the classifier in use, as discussed below.
Classification Model. Previous explore-by-example systems [31, 32] used decision trees to build a classification model. It works well if the user interest pattern is a hyper-rectangle in the data space, whereas real-world applications may use more complex predicates. To support higher complexity, our system uses more powerful classifiers such as Support Vector Machines or Gradient Boosting models. The new techniques proposed in our work do not depend on the specific classifier; they can work with most existing classifiers. But the implementation of active learning does depend on the classifier in use. For ease of composition, in this thesis we use Support Vector Machines as an example classifier.
Support Vector Machines. Support Vector Machines have been widely used for classifi-cation, regression, and other learning tasks such as outliers detection2. For a classification problem, a support vector machine (SVM) constructs a hyperplane in a high or infinite-dimensional space that has the largest distance to the nearest training examples of any class, intending to minimize the generalization error of the classifier.

Table of contents :

1 Introduction
1.1 Technical Challenges
1.2 Main Ideas
1.3 Contributions
1.3.1 Dual-Space Model
1.3.2 High-dimensional Exploration
1.3.3 Learning with Label Noise
1.4 Thesis Outline
2 System Overview and Background
2.1 System Overview
2.2 Related Work
2.2.1 Data Exploration Frameworks and Related Techniques
2.2.2 Active Learning for Data Exploration
2.2.3 Label-Noise Learning
2.3 A Polytope Model in Data Space
2.3.1 Three-Set Partitioning
2.3.2 Lower Bound of F-score
3 Dual-Space Model
3.1 Dual-Space Model and Algorithm
3.2 Extension for Categorical Attributes
3.3 Optimizations
3.3.1 Sampling for Creating the Evaluation Set (OPT1)
3.3.2 Distributed Computing (OPT2)
3.3.3 Sampling for Pool-based Active Learning (OPT3)
3.4 Differences from Prior Work
3.5 Summary
4 High-dimensional Exploration
4.1 Factorization
4.1.1 Factorized Dual-Space Model
4.1.2 Factorized Classifier and Sample Acquisition Strategy
4.1.3 Lower Bound of F-score with Factorization
4.1.4 Testing Assumptions
4.2 Online Feature Selection
4.3 Experimental Evaluation
4.3.1 Experimental Setup
4.3.2 Dual-Space Algorithm with Factorization
4.3.3 Comparison to Alternative Systems
4.3.4 User Study using a Car Database
4.4 Summary
5 Learning with Label Noise
5.1 Label Noise Models and Analysis
5.1.1 Label Noise Models
5.1.2 Analysis of Automatic Noise Distillation
5.2 Robust Dual-Space Model
5.2.1 Data Space Model Refinement
5.2.2 Adaptive Auto-Labeling
5.2.3 Robust Dual-Space Model for Data Exploration
5.3 High-dimensional Exploration with Label Noise
5.3.1 Factorized Robust Dual-Space Model and Algorithm
5.3.2 Effect of Factorization
5.4 Optimizations
5.4.1 Optimization on Interactive Performance
5.4.2 Fine-tuning the Classifier
5.5 Simulation of Label Noise Models
5.6 Experimental Evaluation
5.6.1 Experimental Setup
5.6.2 Comparison of Data Cleansing Methods
5.6.3 Effect of Fine-tuning the Classifier
5.6.4 Evaluation of Robust Dual-Space Algorithm
5.6.5 Evaluation of Optimization on Interactive Performance
5.6.6 Evaluation on User Study Queries
5.7 Summary
6 Conclusions and Future Work
6.1 Thesis Summary
6.2 Future Work
Bibliography
Appendix A User Interest Queries
Appendix B Additional Plots for Label Noise Experiments