Get Complete Project Material File(s) Now! »

## Extreme Single Label Classification

**Introduction**

Single label classification is a well studied problem in machine learning for which several eﬀective solutions resulting from decades of research have been derived and now widely applied to solve industrial problems (Bishop, 2006). However, the rapid development of the internet and the increasingly large amount of available labelled data (thanks to social and collaborative websites such as Wikipedia or Flicker) have changed the nature of the problem and made the most of the tradi-tional approaches to single label classification obsolete since they do not scale to extreme classification. Single label classification has been initially tackled with flat techniques. Recently, there has been an increased interest in hierarchical methods because of their reduced complexity compared to flat approaches. In this chap-ter, we present the main contributions in the literature from these two families of approaches that can be applied to solve extreme single label classification.

**Flat approaches**

Flat approaches to multiclass categorization approaches do not rely on a hierarchy at inference to reduce their complexity (conversely to hierarchical approaches that will be discussed in the next section) even though they can exploit existing seman-tic information to increase their accuracy (Weinberger and Chapelle, 2008). This family of method can be divided into two subgroups that are machine learning reductions (Allwein et al., 2001, Dietterich and Bakiri, 1995) and single machine classifiers (Weinberger and Chapelle, 2008, Weston and Watkins, 1999). These two subgroups are rather diﬀerent in the way they approach the multiclass classifica-tion problem. On one hand machine, learning reductions rest on the predictions of independently learned binary classifiers to produce the final prediction for a given test instance. A notable examples of machine learning reductions is the infamous one-versus-all classifier that will be discussed in more details next. One the other hand single machine classifiers are either embedding based methods or extensions of binary classifiers to the multiclass setting and hence generally require a joint learning of all their parameters.

Extending classical binary classifiers such as support vector machines and logis-tic regression to the multiclass case has been extensively studied and has re-sulted in several eﬀective methods such as multiclass support vector machines (M-SVM) (Weston and Watkins, 1999), softmax regression and more recently deep neural networks (Bengio, 2009). However, despite their proven accuracy, these methods do not scale to extreme classification setting due to large their computational burden at both training and inference even though there are recent attempts to parallelize their training (Gopal and Yang). Therefore, we will only focus in this study on methods that are scalable to extreme classification and refer the interested reader to the following works and the references therein (Bengio, 2009, Bordes et al., 2007, Crammer and Singer, 2002, Gopal and Yang, Weston and Watkins, 1999). We describe next machine learning reductions and embed-ding based approaches to multiclass categorization and discuss how they fit in the context of extreme classification.

**Machine Learning Reductions**

Machine learning reductions of multiclass classification to binary classification have been around for a long time (Dietterich and Bakiri, 1995, Sch¨olkopf et al., 1995). However, these works have only been unified recently in the same framework in-cluding approaches such as Error Correcting Output Codes, One-versus-All, One-versus-One, Filter Trees and many more1 (Allwein et al., 2001). The key idea in all these methods is to use existing binary classifiers and combine them in order to have a multiclass classifier. The diﬀerence between these methods is therefore the way the binary classifiers are combined. This guides the choice of the right method for a given task because it governs the final performance of the methods and their complexity. For example, some widely used methods such as One-versus-One classifier (which train a binary classifier for each pair of label and adopt a voting scheme at inference) cannot be used be used in extreme classification prob-lems (despite their proven performances) because their quadratic complexity in the number of labels O(L2). Next we describe two popular and scalable approaches that are One-versus-Rest (OVR) and Error Correcting Output Codes (ECOC).

**Binary Classification**

The task of classification consist in learning, given a set of training examples D = {(xi, yi)}1≤1≤n, a mapping (or hypothesis) from the d-dimensional feature space to the label space : h : X d → Y. The simplest non-trivial classification problem that can be considered is binary classification where the set of labels is reduced to Y = {−1, +1}. This is arguably the most studied machine learning problem because of its obvious practical interest on its own, and also because it is a building block of more complicated machine learning systems such as multiclass classifiers. From an empirical risk minimization (ERM) point of view, learning a binary classifier reduces to finding the best hypothesis h from a hypothesis class H that minimizes the average number of disagreements between the predictions h(xi) and the actual labels yi on the finite training set D. This quantity, called the empirical risk as it is an approximation of the classifier’s expected risk on the distribution P from which the data was drawn, is generally associated with a regularization term to prevent overfitting (Vapnik, 1995). The final empirical risk is expressed as: RD(h) = =1 L(h(xi), yi) + λ||h|| (2.1)

where L is the loss function measuring the penalty of predicting h(xi) when the actual label is yi for a given instance, λ is the weights the importance of the reg-ularization term. The empirical risk minimizer is obtained when the minimum of RD in the hypothesis class H is attained: h∗ = argminh H RD(h). Depending on the diﬃculty of the problem at hand, various types of hypothesis classes such as linear models, kernel machines or neural networks can be used (Bishop, 2006). For instance, If the linear hypothesis class of functions is considered, each hypothesis h is parameterized by a weight vector w such that for a given example its corre-sponding prediction can be expressed as hw(x) = wT x + b where b is a bias term. Therefore, seeking for the best hypothesis h∗ in this case is equivalent to searching for the best weight vector w∗ according to the chosen loss function L.

The natural classification loss function we would like to optimize for is the zero-one loss : L0/1(h(xi), yi) = I(h(xi) =yi). It counts the number of disagreements between the prediction and the actual label. However this loss is not optimization friendly because it is not convex. Therefore, convex surrogates of L0/1 are used in practice hence resulting in various classifiers. Two notable examples of convex surrogates for the zero-one loss are the Hinge loss and the Logistic loss whose respective expressions are given below for an instance (x, y) and a hypothesis h and figure 2.1 shows how they upper bound the zero-one loss function (Bishop,

2006) :

Lhinge(y, h(x)) = max(0, 1 − y • h(x)) (2.2)

Llogistic(y, h(x)) = log(1 + exp(y • h(x))) (2.3)

Figure 2.1: Surrogates loss functions for the zero-one loss

When the linear hypothesis class is considered, the Hinge loss and the Logistic loss respectively give rise to the Support Vector Machine (SVM) and the Logistic Regression (LR) classifiers. Several eﬃcient solvers exist for these two problems23 mainly relying on stochastic gradient and dual coordinate descent algorithms (Bot-tou and Bousquet, 2008, Yu et al., 2011). Both of SVMs and LR models have demonstrated state of the art performances on several large scale binary classi-fication benchmarks 4. They have become methods of choice for solving binary problems but also as for building machine learning reduction based multiclass classifiers. This has been successfully done with One-Versus-All (OVA) and Error Correcting Output Code (ECOC) classifiers as will be discussed next.

**One-versus-Rest Classifier**

One-versus-Rest (OVR) also called One-versus-All (OVA) is the simplest approach among multiclass machine learning reductions to binary. It consist in training training a binary classifier for each category to distinguish its examples from those of the other classes. At inference, all the classifiers are evaluated and the test ex-ample is associated to the category whose classifier is the most confident (has highest prediction). This a winner-takes-all principle. Early work using this strat-egy dates back to (Sch¨olkopf et al., 1995). However a more cited paper about this method is (Rifkin and Klautau, 2004b). In this work, the authors empiri-cally show that when the binary classifiers are correctly calibrated, this approach outperforms other machine learning reductions (such as One-versus-One classifier (which trains a binary classifier for each pair of category) and Error Correcting Output Codes) and multiclass support vector machines (M-SVM) (Crammer and Singer, 2002, Weston and Watkins, 1999). Moreover, OVA is readily paralleliz-able since the binary classifiers can be trained and evaluated independently. This makes it a good candidate for extreme classification despite the class imbalance problem generally faced when training the binary classifiers.

**Error Correcting Output Codes**

Solving multiclass categorization problems with error correcting output codes (ECOC) has been introduced in (Dietterich and Bakiri, 1995). In this early work, the presented method consist in associating each of the L classes with a bit vector also called binary codeword of fixed size de. The set of codewords are the rows of a coding matrix M ∈ {−1, +1}L×de whose columns correspond to binary clas-sification problems. For each column l, its corresponding binary induced problem reduces to discriminate between the examples of the classes whose corresponding codeword Mj are such that Mjl = +1 from those for which Mjl = −1. Therefore, given a set of training examples {(xi, yi)}1≤n and a coding matrix M, the set of positive and negative example examples of the binary problem induced by column l respectively write D+l = {(xi, yi) : Myil = +1} and D−l = {(xi, yi) : Myil = −1}. Binary classifiers also called dichotomizers (hi)1≤i≤de (such as logistic regression or support vector machines previously described) are learned to predict the bit positions of a codeword. A given test instance x is associated to the class whose binary codeword is the closest (according to some distance measure d) to the predicted codeword for x. If we denote this decoding procedure D, we have:

D(x) = argminj d(Mj, h(x)) where h(x) = (h1(x), . . . , hde(x)) and Mj is the code-word of class j. The decoding procedure generally uses the hamming distance dH row separability : dH (y2, y3) = 2 average row separability : 1.83, (std = 1.12)

y1 y2 y3 y4 y5 y6

y1 −1 +1 − 1 +1

y2 +1 − 1 +1 −1 y1 0 4 2 1 2 2

y3

+1 +1 +1 +1 y2 4 0 2 3 2 2

y4 − − −

+1

1 1 1 y3 2 2 0 2 2 2

y5 −1 +1 +1 −1 y4 1 3 2 0 2 1

y6 +1 − − +1

1 1 y5 2 2 2 2 0 4

column separability : dH (c2, c3) = 2 y6 2 2 2 1 4 0

Figure 2.2: Row and column separability properties ensure good discrimination and error correcting capabilities of the the ECOC model measure since it is simple and fast to compute thanks to existing procedures (Pap-palardo et al., 2009). Subsequent studies such as (Allwein et al., 2001) proposed to use ternary codes instead of binary codes to reduce the number of classes involved in the binary induced problems.

Choosing a good coding matrix is a key factor in the success of an ECOC based multiclass categorizer. Mainly, the two following properties (also depicted in fig-ure 2.2) should be ensured:

Row separability : the codewords must be well separated to guarantee error correction capabilities of the method. Indeed, if the smallest hamming dis-tance between two codewords is δ, the ECOC procedure will be able to correct up to δ/2 mistakes of the dichotomizers at inference (Allwein et al., 2001)

Column separability: to avoid correlated errors between dichotomizers, it is necessary that the problems are suﬃciently diﬀerent. This is guaranteed when the columns of the matrix are well separated.

In general, these two properties are taken into account when generating the cod-ing matrix by sampling several matrices such that P (Mij = +1) = 1/2 and P (Mij = −1) = 1/2 and considering the one having the best error correcting capabilities. Another approach is to choose the coding matrix to be a Hadamard matrix (Langford and Beygelzimer, 2005) which guarantees that the distance be-tween any two codewords is L/2.

Another important element in the final performance of a ECOC based classifier is the accuracy of the dichotomizers. Training accurate dichotomizers is a challenging problem when the coding matrix is randomly generated. In this case there is no other way of ensuring accurate dichotomizers than using powerful non linear classifiers such as kernel machines or neural networks as suggested by (Dietterich and Bakiri, 1995) because of the diﬃculty of the binary induced problems. To overcome this limitation of randomly generated codewords, several authors have proposed to learn the coding matrix (Allwein et al., 2001, Gao and Koller, 2011b, Schapire, 1997). An important contribution in this line of work with applications to extreme classification is the work by (Zhao and Xing, 2013). This approach learns a coding matrix M with ”easy” binary induced problems while ensuring well separated codewords to guarantee error correcting capabilities of the method by solving the following problem.

**Table of contents :**

**1 Introduction **

1.1 Challenges in Extreme Classification

1.1.1 Class Imbalance/Data scarsity

1.1.2 High dimensionality/Large sample size

1.1.3 Structure and Label Dependence exploitation

1.1.4 Training/Inference Complexity reduction

1.2 Contributions

1.3 Outline

**2 Extreme Single Label Classification **

2.1 Introduction

2.2 Flat approaches

2.2.1 Machine Learning Reductions

2.2.1.1 Binary Classification

2.2.1.2 One-versus-Rest Classifier

2.2.1.3 Error Correcting Output Codes

2.2.1.4 Discussion

2.2.2 Embedding approaches

2.2.2.1 Sequence of Convex Problems

2.2.2.2 Joint Non Convex Embeding

2.2.2.3 Discussion

2.2.3 Conclusion

2.3 Hierarchical Approaches

2.3.1 Hierarchical Structure Learning

2.3.1.1 Spectral Clustering

2.3.1.2 Learning Class Hierarchies

Tree structured class hierarchies

DAG structured class hierarchies

2.3.1.3 Discussion

2.3.2 Discriminative Models Learning

2.3.2.1 Independent Optimization ofModels: PachinkoMachines

2.3.2.2 Joint Optimization of Models

2.3.2.3 Regularization Based Approaches

Similarity based dependence modeling:

Dissimilarity based dependence modeling:

2.3.2.4 Sequential Learning of Models

Filter Trees

Refinement

Refined Experts

2.3.3 Joint Learning of Models and Hierarchical Structure

2.3.3.1 Fast and Balanced Hierarchies (Deng et al., 2011)

2.3.3.2 Relaxed Discriminant Hierarchies (Gao and Koller, 2011a)

2.3.3.3 Discussion

2.3.4 Conclusion

**3 Extreme Single Label Classification with Compact Ouput Coding **

3.1 Introduction

3.2 Learning Distributed Representation of Classes (LDR)

3.2.1 Principle

3.2.2 Learning Compact Binary Class-codes

3.2.3 Relations to ECOC

3.2.4 Training and inference complexity

3.3 Experiments

3.3.1 Datasets

3.3.2 Experimental setup

3.3.3 Comparison of the methods

3.3.4 Zero-shot learning

3.4 Conclusion

**4 Extreme Multilabel Classification **

4.1 Introduction

4.2 In defense of Hamming Loss

4.3 On Binary Relevance

4.4 Early approaches to MLC

4.4.1 Stacking Binary Relevance

4.4.2 Classifier Chains (CC)

4.4.3 Label Powerset and friends

4.5 Scalable approaches to Extreme MLC

4.5.1 Label Selection Methods

4.5.1.1 Label Space Pruning

4.5.1.2 Column Subset Selection Method

4.5.2 Label Transformation Methods

4.5.2.1 Compressed Sensing

4.5.2.2 Principle Label Space Transformation

4.6 Conclusion

**5 Extreme Multilabel Classification with Bloom Filters **

5.1 Introduction

5.2 Background on Bloom Filters

5.3 Standard Bloom Filters for Multilabel Classification

5.3.1 Encoding and Decoding

5.3.2 Computational Complexity

Criterion under study

Asymptotic Behavior

Non-Asymptotic Considerations

Simulations on Real Datasets

5.4 Extreme MLC with Robust Bloom Filters

5.4.1 Label Clustering

5.4.2 Encoding and decoding

5.4.2.1 Encoding and Hash functions

5.4.2.2 Decoding and Robustness

Proof:

5.5 Experiments

5.5.1 Datasets

RCV-Industries

Wikipedia1k

5.5.2 Evaluation metrics

5.5.3 Baselines and experimental setup

5.5.4 Parameter selection for Standard Bloom Filters

5.5.5 Parameter selection for Robust Bloom Filters

5.5.6 Correlation Decoding (CD) versus Standard Decoding (SD)

5.5.7 Comparative Results

5.5.8 Runtime analysis

5.6 Conclusion

**6 Conclusion and Perspectives **

**Bibliography**