Get Complete Project Material File(s) Now! »

## Training/Inference Complexity reduction

As for most of the extreme classification problems presented in table 1.1, when the number of categories is very large, both the sample size and the dimensionality of the input space are very large. This results in both memory and computational burden when most of the classical learning machines are used. For example, if the one versus rest classifier that learns one classifier for each category (Rifkin and Klautau, 2004b) is used to solve the classification problem having 325K cate-gories12, more than a 1000 GB are necessary to store all the parameters. Moreover, each of the 325K classifiers must be evaluated before the relevant categories of a given test instance are recovered. The same observation holds for several other methods such as single machine large margin classifiers (Weston and Watkins, 1998) and deep neural networks (Bengio, 2009). Also, these approaches cannot be easily parallelized if one wants to exploit the structure that comes with these problems as shown in Figure 1.2. Therefore, it is critical to come up with new approaches having sublinear training and inference complexity in the number of categories. This has recently been the main subject of several contributions in both the single label and the multilabel setting. Most of these approaches focus on reducing either training (Dekel and Shamir, 2010) or inference complexity (Ben-gio et al., 2010, Deng et al., 2011). Therefore, a lot remains to be done in order to make both learning and prediction fast and memory eﬃcient.

**Contributions**

As stated in the previous section, the presence of a large number of categories gives rise to several (sub-)problems that must be addressed in order to solve ex-treme classification. However, we decide to focus in this thesis on training and inference complexity reduction in extreme classification problems. Indeed, making learning feasible and allowing fast inference is a pre-requisite to the use of machine learning based techniques in real world applications dealing with a large number of categories. We propose new approaches for eﬃciently classifying in this set-ting while maintaining competitive classification performances. Our contribution is twofold and is mainly built on learning low dimensional binary representation of the classes.

The first method we propose deals with extreme single label classification. It uses hierarchical information to learn compact binary codes for the categories. The representation learning procedure uses an auto-encoder based architecture (Ben-gio et al., 2013). The method bares similarities with Error Correcting Output Codes (Dietterich and Bakiri, 1995) (ECOC). However, the induced binary prob-lems are empirically shown to be easier than those induced by the randomly gen-erated codes in ECOCs. Overall, this approach gives competitive performances compared to classical one versus rest method and error correcting output codes.

Our second contribution deals with extreme multilabel classification. It is based on the use of Bloom Filters (Bloom, 1970) for representing subsets of labels us-ing low dimensional binary vectors. The first approach we propose uses standard Bloom Filters to encode and decode the subsets. Even though this method gives competitive performances, it is not robust to individual mistakes of binary clas-sifiers. To overcome this problem, a second approach that exploits the following key property of extreme multilabel classification problems is proposed: when the number of categories is very large, many labels never appear together. This new method is provably robust and gives competitive performances.

**Outline**

This section outlines the core chapters of this thesis. The first two chapters are dedicated to extreme single label classification and the last two’s focus is extreme multilabel classification.

• Chapter 2 is a review of the main techniques that have been proposed in the literature for extreme single label classification. This chapter is divided in two main sections which are respectively about flat approaches and hierar-chical methods for extreme single label classification.

• Chapter 3 presents our contribution to extreme single label classification. The material presented in this chapter has been published at the European Conference of Machine Learning (ECML-PKDD) in 2012 (Ciss´e et al., 2012).

• Chapter 4 reviews the several methods recently proposed in the blossoming research field of extreme multilabel classification.

• Chapter 5 presents the Bloom Filter based approaches we proposed for ex-treme multilabel classification. Part of the material presented in this chapter has been published at the Neural Information Processing Systems conference (NIPS) in 2013 (Cisse et al., 2013) and an extend version is in preparation for the Machine Learning Journal.

Single label classification is a well studied problem in machine learning for which several eﬀective solutions resulting from decades of research have been derived and now widely applied to solve industrial problems (Bishop, 2006). However, the rapid development of the internet and the increasingly large amount of available labelled data (thanks to social and collaborative websites such as Wikipedia or Flicker) have changed the nature of the problem and made the most of the tradi-tional approaches to single label classification obsolete since they do not scale to extreme classification. Single label classification has been initially tackled with flat techniques. Recently, there has been an increased interest in hierarchical methods because of their reduced complexity compared to flat approaches. In this chap-ter, we present the main contributions in the literature from these two families of approaches that can be applied to solve extreme single label classification.

**Flat approaches**

Flat approaches to multiclass categorization approaches do not rely on a hierarchy at inference to reduce their complexity (conversely to hierarchical approaches that will be discussed in the next section) even though they can exploit existing seman-tic information to increase their accuracy (Weinberger and Chapelle, 2008). This family of method can be divided into two subgroups that are machine learning reductions (Allwein et al., 2001, Dietterich and Bakiri, 1995) and single machine classifiers (Weinberger and Chapelle, 2008, Weston and Watkins, 1999). These two subgroups are rather diﬀerent in the way they approach the multiclass classifica-tion problem. On one hand machine, learning reductions rest on the predictions of independently learned binary classifiers to produce the final prediction for a given test instance. A notable examples of machine learning reductions is the infamous one-versus-all classifier that will be discussed in more details next. One the other hand single machine classifiers are either embedding based methods or extensions of binary classifiers to the multiclass setting and hence generally require a joint learning of all their parameters.

Extending classical binary classifiers such as support vector machines and logis-tic regression to the multiclass case has been extensively studied and has re-sulted in several eﬀective methods such as multiclass support vector machines (M-SVM) (Weston and Watkins, 1999), softmax regression and more recently deep neural networks (Bengio, 2009). However, despite their proven accuracy, these methods do not scale to extreme classification setting due to large their computational burden at both training and inference even though there are recent attempts to parallelize their training (Gopal and Yang). Therefore, we will only focus in this study on methods that are scalable to extreme classification and refer the interested reader to the following works and the references therein (Bengio, 2009, Bordes et al., 2007, Crammer and Singer, 2002, Gopal and Yang, Weston and Watkins, 1999). We describe next machine learning reductions and embed-ding based approaches to multiclass categorization and discuss how they fit in the context of extreme classification.$

**Machine Learning Reductions**

Machine learning reductions of multiclass classification to binary classification have been around for a long time (Dietterich and Bakiri, 1995, Sch¨olkopf et al., 1995). However, these works have only been unified recently in the same framework in-cluding approaches such as Error Correcting Output Codes, One-versus-All, One-versus-One, Filter Trees and many more1 (Allwein et al., 2001). The key idea in all these methods is to use existing binary classifiers and combine them in order to have a multiclass classifier. The diﬀerence between these methods is therefore the way the binary classifiers are combined. This guides the choice of the right method for a given task because it governs the final performance of the methods and their complexity. For example, some widely used methods such as One-versus-One classifier (which train a binary classifier for each pair of label and adopt a voting scheme at inference) cannot be used be used in extreme classification prob-lems (despite their proven performances) because their quadratic complexity in the number of labels O(L2). Next we describe two popular and scalable approaches that are One-versus-Rest (OVR) and Error Correcting Output Codes (ECOC).

**Binary Classification**

The task of classification consist in learning, given a set of training examples D = {(xi, yi)}1≤1≤n, a mapping (or hypothesis) from the d-dimensional feature space to the label space : h : X d → Y. The simplest non-trivial classification problem that can be considered is binary classification where the set of labels is reduced to Y = {−1, +1}. This is arguably the most studied machine learning problem because of its obvious practical interest on its own, and also because it is a building block of more complicated machine learning systems such as multiclass classifiers. From an empirical risk minimization (ERM) point of view, learning a binary classifier reduces to finding the best hypothesis h from a hypothesis class H that minimizes the average number of disagreements between the predictions h(xi) and the actual labels yi on the finite training set D. This quantity, called the empirical risk as it is an approximation of the classifier’s expected risk on the distribution P from which the data was drawn, is generally associated with a regularization term to prevent overfitting (Vapnik, 1995). The final empirical risk is expressed as:

where L is the loss function measuring the penalty of predicting h(xi) when the actual label is yi for a given instance, λ is the weights the importance of the reg-ularization term. The empirical risk minimizer is obtained when the minimum of RD in the hypothesis class H is attained: h∗ = argminh H RD(h). Depending on the diﬃculty of the problem at hand, various types of hypothesis classes such as linear models, kernel machines or neural networks can be used (Bishop, 2006). For instance, If the linear hypothesis class of functions is considered, each hypothesis h is parameterized by a weight vector w such that for a given example its corre-sponding prediction can be expressed as hw(x) = wT x + b where b is a bias term. Therefore, seeking for the best hypothesis h∗ in this case is equivalent to searching for the best weight vector w∗ according to the chosen loss function L.

The natural classification loss function we would like to optimize for is the zero-one loss : L0/1(h(xi), yi) = I(h(xi) =yi). It counts the number of disagreements between the prediction and the actual label. However this loss is not optimization friendly because it is not convex. Therefore, convex surrogates of L0/1 are used in practice hence resulting in various classifiers. Two notable examples of convex surrogates for the zero-one loss are the Hinge loss and the Logistic loss whose respective expressions are given below for an instance (x, y) and a hypothesis h and figure 2.1 shows how they upper bound the zero-one loss function (Bishop, 2006) :

Lhinge(y, h(x)) = max(0, 1 − y • h(x)) (2.2)

Llogistic(y, h(x)) = log(1 + exp(y • h(x))) (2.3)

When the linear hypothesis class is considered, the Hinge loss and the Logistic loss respectively give rise to the Support Vector Machine (SVM) and the Logistic Regression (LR) classifiers. Several eﬃcient solvers exist for these two problems23 mainly relying on stochastic gradient and dual coordinate descent algorithms (Bot-tou and Bousquet, 2008, Yu et al., 2011). Both of SVMs and LR models have demonstrated state of the art performances on several large scale binary classi-fication benchmarks 4. They have become methods of choice for solving binary problems but also as for building machine learning reduction based multiclass classifiers. This has been successfully done with One-Versus-All (OVA) and Error Correcting Output Code (ECOC) classifiers as will be discussed next.

**One-versus-Rest Classifier**

One-versus-Rest (OVR) also called One-versus-All (OVA) is the simplest approach among multiclass machine learning reductions to binary. It consist in training training a binary classifier for each category to distinguish its examples from those of the other classes. At inference, all the classifiers are evaluated and the test ex-ample is associated to the category whose classifier is the most confident (has highest prediction). This a winner-takes-all principle. Early work using this strat-egy dates back to (Sch¨olkopf et al., 1995). However a more cited paper about this method is (Rifkin and Klautau, 2004b). In this work, the authors empiri-cally show that when the binary classifiers are correctly calibrated, this approach outperforms other machine learning reductions (such as One-versus-One classifier (which trains a binary classifier for each pair of category) and Error Correcting Output Codes) and multiclass support vector machines (M-SVM) (Crammer and Singer, 2002, Weston and Watkins, 1999). Moreover, OVA is readily paralleliz-able since the binary classifiers can be trained and evaluated independently. This makes it a good candidate for extreme classification despite the class imbalance problem generally faced when training the binary classifiers.

**Error Correcting Output Codes**

Solving multiclass categorization problems with error correcting output codes (ECOC) has been introduced in (Dietterich and Bakiri, 1995). In this early work, the presented method consist in associating each of the L classes with a bit vector also called binary codeword of fixed size de. The set of codewords are the rows of a coding matrix M ∈ {−1, +1}L×de whose columns correspond to binary clas-sification problems. For each column l, its corresponding binary induced problem reduces to discriminate between the examples of the classes whose corresponding codeword Mj are such that Mjl = +1 from those for which Mjl = −1. Therefore, given a set of training examples {(xi, yi)}1≤n and a coding matrix M, the set of positive and negative example examples of the binary problem induced by column l respectively write D+l = {(xi, yi) : Myil = +1} and D−l = {(xi, yi) : Myil = −1}. Binary classifiers also called dichotomizers (hi)1≤i≤de (such as logistic regression or support vector machines previously described) are learned to predict the bit positions of a codeword. A given test instance x is associated to the class whose binary codeword is the closest (according to some distance measure d) to the predicted codeword for x. If we denote this decoding procedure D, we have:

Choosing a good coding matrix is a key factor in the success of an ECOC based multiclass categorizer. Mainly, the two following properties (also depicted in fig-ure 2.2) should be ensured:

Row separability : the codewords must be well separated to guarantee error correction capabilities of the method. Indeed, if the smallest hamming dis-tance between two codewords is δ, the ECOC procedure will be able to correct up to δ/2 mistakes of the dichotomizers at inference (Allwein et al., 2001) Column separability: to avoid correlated errors between dichotomizers, it is necessary that the problems are suﬃciently diﬀerent. This is guaranteed when the columns of the matrix are well separated.

**Table of contents :**

1 Introduction

1.1 Challenges in Extreme Classification

1.1.1 Class Imbalance/Data scarsity

1.1.2 High dimensionality/Large sample size

1.1.3 Structure and Label Dependence exploitation

1.1.4 Training/Inference Complexity reduction

1.2 Contributions

1.3 Outline

**2 Extreme Single Label Classification **

2.1 Introduction

2.2 Flat approaches

2.2.1 Machine Learning Reductions

2.2.1.1 Binary Classification

2.2.1.2 One-versus-Rest Classifier

2.2.1.3 Error Correcting Output Codes

2.2.1.4 Discussion

2.2.2 Embedding approaches

2.2.2.1 Sequence of Convex Problems

2.2.2.2 Joint Non Convex Embeding

2.2.2.3 Discussion

2.2.3 Conclusion

2.3 Hierarchical Approaches

2.3.1 Hierarchical Structure Learning

2.3.1.1 Spectral Clustering

2.3.1.2 Learning Class Hierarchies

2.3.1.3 Discussion

2.3.2 Discriminative Models Learning

2.3.2.1 Independent Optimization ofModels: PachinkoMachines

2.3.2.2 Joint Optimization of Models

2.3.2.3 Regularization Based Approaches

Similarity based dependence modeling:

Dissimilarity based dependence modeling:

2.3.2.4 Sequential Learning of Models

2.3.3 Joint Learning of Models and Hierarchical Structure

2.3.3.1 Fast and Balanced Hierarchies (Deng et al., 2011)

2.3.3.2 Relaxed Discriminant Hierarchies (Gao and Koller, 2011a)

2.3.3.3 Discussion

2.3.4 Conclusion

**3 Extreme Single Label Classification with Compact Ouput Coding**

3.1 Introduction

3.2 Learning Distributed Representation of Classes (LDR)

3.2.1 Principle

3.2.2 Learning Compact Binary Class-codes

3.2.3 Relations to ECOC

3.2.4 Training and inference complexity

3.3 Experiments

3.3.1 Datasets

3.3.2 Experimental setup

3.3.3 Comparison of the methods

3.3.4 Zero-shot learning

3.4 Conclusion

Contents iv

**4 Extreme Multilabel Classification **

4.1 Introduction

4.2 In defense of Hamming Loss

4.3 On Binary Relevance

4.4 Early approaches to MLC

4.4.1 Stacking Binary Relevance

4.4.2 Classifier Chains (CC)

4.4.3 Label Powerset and friends

4.5 Scalable approaches to Extreme MLC

4.5.1 Label Selection Methods

4.5.1.1 Label Space Pruning

4.5.1.2 Column Subset Selection Method

4.5.2 Label Transformation Methods

4.5.2.1 Compressed Sensing

4.5.2.2 Principle Label Space Transformation

4.6 Conclusion

**5 Extreme Multilabel Classification with Bloom Filters **

5.1 Introduction

5.2 Background on Bloom Filters

5.3 Standard Bloom Filters for Multilabel Classification

5.3.1 Encoding and Decoding

5.3.2 Computational Complexitys

5.4 Extreme MLC with Robust Bloom Filters

5.4.1 Label Clustering

5.4.2 Encoding and decoding

5.4.2.1 Encoding and Hash functions

5.4.2.2 Decoding and Robustness Proof:

5.5 Experiments

5.5.1 Datasets

5.5.2 Evaluation metrics

5.5.3 Baselines and experimental setup

5.5.4 Parameter selection for Standard Bloom Filters

5.5.5 Parameter selection for Robust Bloom Filters

5.5.6 Correlation Decoding (CD) versus Standard Decoding (SD)

5.5.7 Comparative Results

5.5.8 Runtime analysis

5.6 Conclusion

**6 Conclusion and Perspectives **

**Bibliography**