Get Complete Project Material File(s) Now! »

## Active Learning and Version Space Algorithms

In this first section, we provide an overview of the active learning domain and, in particular, version space techniques.

**Active Learning**

Settles [2012] and Monarch [2021] provide an overview of the most relevant results in the active learning (AL) literature. We focus our attention on a particular form called pool-based active learning, in which a classification model is built over a large data set by relying on user feedback over targeted data samples. More precisely, based on the current classification model, the active learning algorithm can inspect the data pool and extract the data example that, if labeled, will result in the fastest increase in classification accuracy. In doing so, active learners can build an accurate classifier with few labeled examples and, consequently, minimal user eﬀort.

In practice, AL is usually employed in domains where labeling data is expen-sive and labeled data sets are scarce. For example, recent fields of application in-clude biomedical image segmentation [Yang et al., 2017], health monitoring [Bull et al., 2018], and crisis management [Pohl et al., 2018]. Active learning has also been applied in crowd-sourcing scenarios [Song et al., 2018, Hantke et al., 2017], where multiple annotators work on the same labeling task. In our work, we in-tend to apply active learning, in particular the version space algorithms, in the setting of human-in-the-loop model development.

**The Version Space**

Mitchell [1977, 1982] was the first to introduce the concept of Version Space (VS). In his work, Mitchell considered the concept learning problem, in which one aims to identify the true concept (hypothesis) h⇤ 2 H ⇢ {h : x 7!y 2 {±1}} from its evaluations over a finite set of examples L = {(xi, yi)}mi=1 with yi = h⇤(xi). To determine h⇤, Mitchell introduced a special subset of classifiers called the version space V, defined as the set of all hypotheses perfectly matching the training data:

V = {h 2 H : h(xi) = yi for all (xi, yi) 2 L} (2.1)

The main interest of considering the version space is it possesses two main properties:

• Assuming no labeling mistakes in L, the true concept h⇤ must be inside V, that is, h⇤ 2 V.

• The version space V tends to shrink as more labeled data becomes available.

From these two properties, one can interpret the version space V as the set of all candidate hypotheses for h⇤. In particular, if enough labeled data are available, the version space is reduced to a single element that must correspond to the true concept, that is, V = {h⇤}. Additionally, in case the available labeled data is not enough to identify h⇤, Mitchell [1982, 1997] also proposed two relevant ideas:

• Majority Voting : We can approximate h⇤ by a majority voting among candidate hypotheses h 2 V.

• Labeling new examples: Another option is to add new labeled examples to L. In this case, Mitchell proposes to label the objects x over which the candidate hypotheses disagree the most. This way, each new labeled example would eliminate a large portion of candidates from V and, consequently, help quickly identify h⇤.

In recent years, the idea of version space for tackling concept learning tasks and, in general, supervised learning problems has largely been discarded because of its natural limitations. First, version-space-based algorithms like FIND-S and CANDIDATE ELIMINATION [Mitchell, 1982, 1997] need to maintain some represen-tation of V, which can be exponential in size even for simple concepts [Haussler, 1988, Sauer, 1972]. In addition, its intolerance to label noise makes it impracti-cal to most use cases; even though a few works had some success in proposing a noise-tolerant VS strategy [Hirsh, 1990, Hong and Tseng, 1997], they can still suﬀer from the exponential size problem. However, despite its limitations, VS-based techniques have still found relative success in the Active Learning scenario, which we discuss next.

### Version Space Methods in Active Learning

One domain where version-space-based techniques have found their success is in Active Learning (AL). Based on the groundwork set by Mitchell [1982, 1997], several AL strategies have been devised to explore the version space properties. More precisely, these techniques aim to reduce the version space V as quickly as possible until we are left with a single hypothesis achieving perfect classification. Various strategies have been developed to estimate the VS reduction of each sample, each oﬀering a diﬀerent trade-oﬀ in terms of performance, eﬃciency, and theoretical guarantees. Below, we give an overview of the most relevant algorithms in the AL literature: Generalized Binary Search (GBS) [Dasgupta, 2005, Golovin and Krause, 2011]: Also called the Version Space Bisection rule, this algorithm implements the data labeling strategy proposed by Mitchell [1977, 1982], searching for a point x for which the classifiers in V disagree the most. More precisely, let us define Vx,y = {h 2 V : h(x) = y} and let ⇡ be any probability distribution over the set of classifiers corresponding to our prior knowledge about the user interest (e.g. uniform). Then, the GBS strategy selects the point x for which the sets Vx,y have approximately the same size for all labels, that is, it minimizes maxy ⇡(Vx,y).

The main interest behind GBS comes from its strong theoretical guarantees. If we define the cost of any active learner A as the average number of queries required to reduce V to a single element (and, consequently, identifying the user interest), then one can show that cost(GBS) OP T • 1 + log minh ⇡(h) where OP T = minA cost(A). Because of this, GBS is said to possess near-optimal performance guarantees.

However, implementing the bisection rule is prohibitively expensive: for each instance x in the data set, one must evaluate h(x) and ⇡(h) for each hypothesis h in the version space, which is often exponential in size O(mD), where m is the size of the data set and D is the VC-dimension [Sauer, 1972]. Due to this limitation, several approximations of the bisection rule have been introduced in the literature.

Simple Margin [Tong and Koller, 2001]: Simple Margin is a rough approxima-tion of the bisection rule for SVM classifiers, leveraging the heuristic that data points close to the SVM’s decision boundary closely bisect the version space V. However, this method can suﬀer from slow convergence in practice, especially when V is very asymmetric. In addition, Simple Margin does not possess any theoretical guarantees on performance.

Kernel Query-by-Committee (KQBC) [Gilad-Bachrach et al., 2006] and Query-by-Disagreement (QBD) [Settles, 2012]: These algorithms approx-imate the bisection rule by constructing two representative hypotheses in the version space and then select any data point for which these two classifiers dis-agree. In particular, KQBC samples two random hypotheses from the version space, while QBD trains one positive and one negatively biased classifier. Again, these methods do not provide any theoretical results on performance and, on top of that, they can suﬀer from slow convergence since the selected point is only guaranteed to “cut” V, but possibly not by a significant amount.

ALuMA [Gonen et al., 2013]: ALuMA is an application of the Generalized Bi-nary Search [Golovin and Krause, 2011] algorithm for linear classifiers and uses a sampling technique for estimating the version space reduction of any sample. It can also support kernels via a preprocessing step, but this procedure can be very expensive to run. Additionally, ALuMA was also show to possess similar theoretical guarantees on performance as the original GBS algorithm. Finally, ALuMA was shown to outperform several other version space algorithms, includ-ing Simple Margin and Query-by-Committee [Seung et al., 1992]. Hence, we use ALuMA as a baseline version space algorithm in this work.

More recently, version-space-based techniques have also been successfully ap-plied to a broader range of scenarios, such as cost-sensitive classification [Krish-namurthy et al., 2017] and batch-mode active learning [Chen and Krause, 2013]. In this work, we focus on how version space techniques can be applied to the interactive model development problem.

**Data Exploration and Model Development**

Next, we consider other forms of interactive model development in the literature, including a recent line of work on human-in-the-loop data exploration systems.

Data Programming under Weak Supervision. One form of interactive model development is the data programming framework introduced by SNORKEL [Rat-ner et al., 2016, Bach et al., 2017], which has gained attraction in recent years.

In this framework, an expert user writes a set of labeling functions represent-ing simple heuristics used for labeling data instances. By treating these labeling functions as weak learners, SNORKEL can build an accurate classifier without having the user manually label any data instance. In more recent work, a data programming system called SNUBA [Varma and Ré, 2018] was designed to auto-matically generate labeling functions from an initial labeled data set, leading to improved classification accuracy and further reduction of the user manual eﬀort. However, this system still requires an initial labeled set in the order of hundreds to thousands of labeled instances, which is nontrivial to obtain, especially in the new trend of “Machine Learning for Everyone” where an everyday user may start with no labeled data at all.

Multi-Criteria Decision Aid. Another interesting form of human-in-the-loop data exploration comes from the Multi-Criteria Decision Aid (MCDA) framework [Figueira et al., 2005, Nemery and Ishizaka, 2013]. In MCDA, the main objective is to design a system capable of helping users make complex decisions over large data sets when facing multiple conflicting criteria. More precisely, the user has to specify a collection of functions {ui}i computing relevance scores for each data object x, and the system will help the user decide on which elements are the best according to the specified criteria. For example, consider the case of a user looking to buy a car. In this case, the user could specify a few criteria, such as the vehicle’s price, speed, and size, and then ask the system to find the cars oﬀering the best trade-oﬀ according to these metrics.

One popular approach to MCDA is to rely on the Pareto optimality crite-ria [Grierson, 2008, Talukder and Deb, 2020]. More precisely, these methods focus on retrieving the data points x on the Pareto front defined by {ui(x)}i since they cannot be dominated by any other object in the data set. However, such methods still require the user to further investigate and refine the set of Pareto optimal elements, which may be non-trivial to do for a high number of criteria [Talukder and Deb, 2020]. Another common approach to MCDA involves Multi-Attribute Utility Theory (MAUT) [Nemery and Ishizaka, 2013, Bresson et al., 2020] meth-ods. Its main idea is to train, with the user’s help, an aggregation function A that combines all criteria into a single value condensing the overall utility of a data object. However, one problem with these methods lies in interpretability: finding the right balance between the interpretability of the utility score and the repre-sentation power of A is a non-trivial issue. One approach to this problem is to consider the Choquet integral aggregator [Bresson et al., 2020], which can model interactions between criteria while maintaining some degree of interpretation by the user.

Interactive Data Exploration. In the interactive data exploration domain, the main objective is to design a system that guides the user towards discovering relevant data points in a large data set. One recent line of work is “explore-by-example” systems [Dimitriadou et al., 2014, 2016, Liu et al., 2017, Huang et al., 2018], which leverage active learning techniques to construct an accurate model of the user interest from a small set of user-labeled data points. These human-in-the-loop systems require minimizing both the user labeling eﬀort and the running time to find a new data point for labeling in each iteration. In particular, recent work [Huang et al., 2018] is shown to outperform prior work [Dimitriadou et al., 2014, 2016] via the following technique:

Dual-Space Model (DSM) [Huang et al., 2018]: In this work, the user interest pattern is assumed to form a convex object in the data space. By leveraging this property, this work proposes a dual-space model (DSM) that builds a polytope model of the user interest on top of any active learning strategy. In particular, this polytope model partitions the data space into three segments: the positive region, where points are guaranteed to be interesting, the negative region, where points are guaranteed to be uninteresting, and the remaining unknown region. In doing so, DSM can use the polytope model to automatically label most of the data points selected by the active learner and, consequently, significantly reduce the user labeling eﬀort.

In addition, DSM can also leverage a factorization structure provided by the user in a limited form. When the user pattern is a conjunction of convex or concave sub-patterns, their system factorizes the data space into low-dimensional subspaces and runs the dual-space model in each subspace. In doing so, the original complex exploration process is broken into several simpler tasks, thus significantly improving the performance of their system.

#### Interpretable Models and Feature Selection

This section explores the Machine Learning and Statistics literature for works that could help us develop a classification model mimicking the user decision-making process under factorization, as described in Section 1.4. In particular, we focus our attention on two main lines of work: Interpretable Machine Learning and Feature Selection.

Interpretable Machine Learning. In the Interpretable Machine Learning domain, the main objective is to design machine learning models whose pre-dictive processes can be easily interpreted by humans. More precisely, besides striving for accurate predictions, interpretable models aim to provide people a better understanding of “how” and “why” predictions are made. In particular, the factorized classifier we aim to build is also interpretable given that it mimics the reasoning of real users, making this line of work very suitable for drawing inspiration from. In what follows, we focus our attention on a few most relevant works in this domain, leaving a complete survey on the topic to Molnar [2019].

• Logistic Regression: Logistic Regression is one of the simplest and most studied classification models in the literature. It starts by assuming that the positive and negative classes can be well-separated by a linear hyper-plane b+ hx, wi, and constructs a probabilistic model of the label Y 2 {±1}

on the form P(Y = y|X = x) = (y(b + hx, wi)) with being the sigmoid function. Due to its simplicity, the logistic regression classifier can be rel-atively easy to interpret: not only it leads to linear decision boundaries, but its weights also represent a rough measure of the feature importance. However, this model’s simplicity can also be a disadvantage since it be-comes unable to accurately learn more complex interest patterns: in fact, none of the user patterns we observed in practice are linearly separable; see Appendix D for more information.

• Decision Rules [Holte, 1993, Letham et al., 2015]: Decision rules attempt to break the user interest pattern as a collection of simple IF-THEN con-ditions. For example, over a dataset containing cars, this algorithm may generate conditions on the form “IF price > 100k THEN Y = 0” or “IF color = red AND horsepower > 100 THEN Y = 1”. As we can see, de-cision rules are simple to interpret by humans since their decisions corre-spond to combinations of simple logical formulas. Algorithms can also im-prove interpretability by reducing the number of features composing each rule [Holte, 1993] and minimizing conflicts between diﬀerent rules [Letham et al., 2015]. However, rule-based learning also has some fundamental dis-advantages: first, most models cannot deal with numerical attributes and require an initial discretization step; finally, IF-THEN rules usually fail to describe linear relationships between features and output and often require the combination of several rules to compensate for this deficiency.

• Projection Retrieval for Classification (PRC) [Fiterau and Dubrawski, 2012]: PRC methods assume that labels can be accurately inferred from a small selection of features and aim to find which sets of attributes are more rele-vant for diﬀerent data points. In other words, PRC algorithms compute a mapping between data points and low-dimensional projections, eﬀectively partitioning the data into sets that can be accurately classified with a small number of targeted features. Similar to our factorized classifier idea, PRC also attempts to break the complex user labeling process into a collection of simple, low-dimensional decisions. However, both methods still diﬀer in their execution of this idea: PRC models the labeling process as a sin-gle low-dimensional decision for each point, while factorization models the label of any point as a set of several low-dimensional decisions.

In summary, we can achieve interpretability through very diﬀerent means. Some methods bet on a simple relationship between features and output, while others attempt to break a complex decision as a collection of easy-to-interpret classifiers. However, despite having similar motivations, none of the above tech-niques directly addresses the factorization of the user decision-making process as observed in our user studies.

Feature Selection. Feature selection methods attempt to find a small set of most relevant features for prediction while having minimal impact on perfor-mance. In the factorization-learning scenario, although these methods cannot directly deal with multiple subspaces of features (at least not without the sub-space labels), we expect that some form of feature selection will be necessary to induce a low-dimensional representation in each subspace. In particular, we will focus our attention on the following lines of work:

• Lasso-based penalization [Tibshirani, 1996]: In high-dimensional scenarios, the interpretability of linear classifiers becomes compromised by the high number of parameters to evaluate. Lasso-based penalties were introduced to counter this problem: by including an L1-based penalty term to the classification loss, the optimal weight vector is guaranteed to be sparse, thus eﬀectively removing irrelevant features from consideration. In more recent times, Lasso-based penalization has been successfully applied in a wide range of scenarios [Hastie et al., 2015] such as Compressed Sensing and Matrix Completion.

• Group Lasso [Yuan and Lin, 2006]: In some applications, one may be in-terested in selecting whole groups of features instead of individual ones. In such cases, the Lasso-based penalization scheme has been extended into the Group Lasso, an L2-based penalty function that allows for naturally grouping diﬀerent sets of features. This grouped selection property is par-ticularly important for selecting categorical variables [Chiquet et al., 2016] once they are encoded into groups of numerical features.

• Importance-based selection [Breiman, 2001, Strobl et al., 2007]: Another common feature selection technique is to rank features based on their im-portance. Importance is often calculated through some particular property of the model under consideration. For example, linear classifiers define im-portance as the absolute value of its weights, while decision trees measure the mean decrease in impurity for each feature. However, some methods can also compute importance in a model-agnostic fashion. Importance per-mutation [Strobl et al., 2007], for instance, measures the impact on the performance caused by randomly permuting the values of each feature; intuitively, more important features should lead to a bigger impact on per-formance. Overall, feature-importance methods can be reasonably eﬀective in practice and also eﬃcient to run.

**Table of contents :**

**1 Introduction **

1.1 Technical Challenges of Interactive Model Development

1.2 Version Space Algorithms

1.3 Factorization Structure

1.4 Learning a Factorized Classification Model

1.5 Thesis Organization

**2 LiteratureReview **

2.1 Active Learning and Version Space Algorithms

2.1.1 Active Learning

2.1.2 The Version Space

2.1.3 Version Space Methods in Active Learning

2.2 Data Exploration and Model Development

2.3 Interpretable Models and Feature Selection

**3 Generalized Binary Search over Kernel Classifiers **

3.1 Generalized Binary Search Algorithm Overview

3.2 Parameterized Hypothesis Space

3.3 Kernel Classifiers

3.4 Dimensionality Reduction

3.5 Approximations for Large Data Sets

3.5.1 The Majority Vote Classifier

3.5.2 Approximations for Majority Vote

3.6 Chapter Summary

**4 Implementation and Optimizations **

4.0.1 Computing the Partial Cholesky Factors `j

4.1 Estimating the Cut Probabilities via Sampling

4.2 Improving the Sample Quality via Rounding

4.2.1 The Ellipsoid Caching Algorithm

4.2.2 An Optimized Rounding Algorithm

4.3 Hit-and-run’s Starting Point

4.4 Experimental Results

4.5 Chapter Summary

**5 AFactorizedVersionSpaceAlgorithm **

5.1 Introduction to Factorized Version Space

5.2 Overview of a Factorized Version Space Algorithm

5.3 Bisection Rule over Factorized Version Space

5.4 Factorized Squared-Loss Strategy

5.5 Factorized Product-Loss Strategy

5.6 Optimization for Categorical Subspaces

5.7 Experimental Results

5.8 Chapter Summary

**6 Learning a Factorized Classifier **

6.1 Learning a Factorized Classifier

6.1.1 Training an Accurate Factorized Model

6.1.2 Feature and Subspace Selection

6.1.3 Understanding the Factorized Linear Model

6.1.4 Factorized Kernel Model

6.1.5 Dealing with Categorical Variables

6.2 The Active Learning Scenario

6.2.1 Uncertainty Sampling for FLM

6.2.2 Swapping Algorithm for FLM

6.2.3 Switching to an Explicitly Factorized Algorithm

6.3 Evaluation in the Batch Learning Setting

6.4 Evaluation in the Active Learning Setting

6.5 Chapter Summary

**7 Conclusions and Future Work **

7.1 Thesis Summary

7.2 Future Work

A Proof of Dimensionality Reduction Lemma (Lemma 3.8) A-1

B Hit-and-Run Implementation B-1

C Rounding Implementation C-1

D User Queries D-1