Efficient Learning of Linear SVMs with Stochastic Gradient Descent

Get Complete Project Material File(s) Now! »

New Eﬃcient Algorithms for Support Vector Machines

We now detail the contributions to the field of large-scale machine learning proposed in this dissertation. They can be split in three diﬀerent pieces: (1) a novel generic algorithmic scheme for conceiving online SVMs solvers which have been successfully applied to classification and structured output prediction, (2) a quasi-Newton stochastic gradient algorithm for linear binary SVMs, (3) a method for learning SVMs under ambiguous supervision. Most of them have been the object of peer-reviewed publications in international journals or conference proceedings (see Appendix A).

A New Generation of Online SVM Dual Solvers

We present a new kind of solver for the dual formulation of SVMs. This contribution is actually threefold and takes up the main part of this thesis: it is the topic of both Chapter 4 and Chapter 5 (and also Appendix B).
Figure 1.6: Learning with the Process/Reprocess principle Compared to a standard online process, an additional memory storage is added (green square). (1) At each iteration, a training example is either drawn from the training set ((1a) process) or from the additional memory ((1b) reprocess). (2) The learning algorithm (center) takes this single example as input.
(3) After a learning step on it, this example is either discarded (3a) or stored in the memory (3b). The procedure (1)-(2)-(3) is carried-out until the training set is empty. (4) Anytime, one can have access to the current learnt spam filter.

The Process/Reprocess Principle

These new algorithms perform an online optimization of the dual objective of Support Vector Machines based on a so-called process/reprocess principle: when receiving a new example, they perform a first optimization step similar to that of a common online algorithm. In addition to this PROCESS operation, they perform REPROCESS operations: each of which is a basic optimization step applied to randomly chosen previously seen training examples. Figure 1.6 illustrates this learning scheme. The REPROCESS operations force these algorithms to store a fraction of the train-ing examples to re-visit them now and then. This causes extra-storing and extra-computations compared to standard online algorithms: these methods are not strictly online.6 However these training algorithms still scale better than batch methods because the number of stored examples is usually much smaller then the training set size.
This alternative online behavior presents interesting properties, especially for large-scale ap-plications. Indeed, results provided in this dissertation show that online optimization with the PROCESS/REPROCESS principle leads to algorithms providing fair approximate solutions on the whole course of learning and achieving good accuracies while having low computational costs.

A Carefully Designed Second-Order SGD

Stochastic Gradient Descent is known to be a fast learning algorithm in the large-scale setup. In particular, numerous recent works report great performances for training linear SVMs.
In Chapter 3, we discuss how to train eﬃciently linear SVMs and propose SGD-QN: a stochas-tic gradient descent algorithm that makes careful use of second-order information and splits the parameter update into independently scheduled components. Thanks to this design, SGD-QN iterates nearly as fast as a first-order stochastic gradient descent but requires less iterations to achieve the same accuracy. This algorithm won the “Wild Track” of the first PASCAL Large Scale Learning Challenge [Sonnenburg et al., 2008].

A Learning Method for Ambiguously Supervised SVMs

This contribution addresses the fresh problem of learning from ambiguous supervision, focusing on the task of semantic parsing. A learning problem is said to be ambiguously supervised when, for a given training input, a set of output candidates (rather than the only correct output) is provided with no prior of which one is correct. In Chapter 6 is then introduced a new reduction from ambiguous multiclass classification to the problem of noisy label ranking, which we then cast into a SVMs formulation. We propose an online algorithm for learning these SVMs. An empirical validation on semantic parsing data sets demonstrates the eﬃciency of this approach.
This contribution does not directly focus on large-scale learning. In particular, the related experiments concern small-size data sets. Yet, our contribution involves an online algorithm presenting good scaling properties towards large-scale problems.
Moreover, we believe this chapter is important because learning from ambiguous supervision will be a key challenge in the future. Indeed, the cost for producing ambiguously annotated corpora is far less than the one required for producing perfectly annotated ones. Large-scale ambiguously annotated data sets will be likely to appear in the next few years. Being able to properly use them would be rewarding.

Solving SVMs with SMO

Eﬃcient batch numerical algorithms have been developed to solve the SVM QP problem (2.9). The best known methods are the Conjugate Gradient method [Vapnik, 1982, pages 359–362] and the Sequential Minimal Optimization (SMO) [Platt, 1999]. Both methods work by making suc-cessive searches along well chosen directions. Some famous SVM solvers like SVMLight [Joachims, 1999] or SVMTorch [Collobert and Bengio, 2001] propose to use decomposition algorithms to de-fine such directions. This section mainly details SMO, as this is our main reference SVM solver in this thesis. In particular, we compare our methods with the state-of-the-art implementation of SMO, LibSVM [Chang and Lin, 2001 2004]. For a complete review of eﬃcient batch SVM solvers see [Bottou and Lin, 2007].

Solving Linear SVMs

The use of a linear kernel heavily simplifies the SVM optimization problem. Indeed such a kernel allows to explicitly express the parameter vector w. This means (i) no need to use a kernel expansion as in (2.2) anymore, and (ii) no need to store or compute the kernel matrix. Computing gradients of either the primal or dual cost function is cheap and depends only on the sparsity of the instances. The use of linear kernel is thus very interesting when one needs to handle large-scale databases. However this simpler complexity can also result in a loss of accuracy compared to non-linear kernels (e.g. polynomial, RBF, . . . ).
Recent work exhibits new algorithms scaling linearly in time with the number of training ex-amples. SVMPerf [Joachims, 2006] is a simple cutting-plane algorithm for training linear SVMs that is shown to converge in linear time for classification. It is based on SVMstruct, an alterna-tive formulation of the SVM optimization problem originally designed for predicting structured outputs (presented in the next section), that exhibits a diﬀerent form of sparsity compared to the conventional formulation. The algorithm is empirically very fast and has an intuitively meaning-ful stopping criterion. Bundle methods [Smola et al., 2008] perform in a similar way. LibLinear [Hsieh et al., 2008] also reaches good performance on large scale data sets. Employing an eﬃcient dual coordinate descent procedure, it converges in linear time. Special care has been taken to its implementation as described in [Fan et al., 2008]. As a result, experiments show that LibLinear outperforms SVMPerf in practice.

READ Priority preserving control under kinodynamic constraints

From Multiclass Classification to Structured Output Prediction

When using SVMs, structured output prediction is highly related to multiclass classification which is a well-known task in machine learning. The most widely used approaches combine multiple binary classifiers separately trained using either the one-versus-all or one-versus-one scheme (e.g. [Hsu and Lin, 2002]). Alternative proposals [Weston and Watkins, 1998, Cram-mer and Singer, 2001] reformulate the large margin problem to directly address the multiclass problem. These algorithms are more expensive because they must simultaneously handle all the support vectors associated with diﬀerent inter-class boundaries. Unfortunately, rigorous exper-iments [Hsu and Lin, 2002, Rifkin and Klautau, 2004] suggest that this higher cost does not translate into higher generalization performance.
The picture changes when, instead of predicting an atomic class label for each input pattern, one targets to produce complex discrete outputs such as sequences, trees, or graphs. Such problems can still be viewed as multiclass (potential outputs can be enumerated, in theory) but with a number of classes growing exponentially with the characteristic size of the output. Yet, dealing with so many classes in a large margin classifier is infeasible without smart factorizations that leverage the specific structure of the outputs (e.g. Section 2.2 or [Taskar et al., 2005]). This can only be achieved using a direct multiclass formulation because the factorization of the output space implies that all the classes must be handled simultaneously.

Table of contents :

1 Introduction
1.1 Large Scale Machine Learning
1.1.1 Machine Learning
1.1.2 Towards Large Scale Applications
1.1.3 Online Learning
1.1.4 Scope of this Thesis
1.2 New Efficient Algorithms for Support Vector Machines
1.2.1 A New Generation of Online SVM Dual Solvers
1.2.2 A Carefully Designed Second-Order SGD
1.2.3 A Learning Method for Ambiguously Supervised SVMs
1.2.4 Careful Implementations
1.3 Outline of the Thesis
2 Support Vector Machines
2.1 Kernel Classifiers
2.1.1 Support Vector Machines
2.1.2 Solving SVMs with SMO
2.1.3 Online Kernel Classifiers
2.1.4 Solving Linear SVMs
2.2 SVMs for Structured Output Prediction
2.2.1 SVM Formulation
2.2.2 Batch Structured Output Solvers
2.2.3 Online Learning for Structured Outputs
2.3 Summary
3 Efficient Learning of Linear SVMs with Stochastic Gradient Descent
3.1 Stochastic Gradient Descent
3.1.1 Analysis
3.1.2 Scheduling Stochastic Updates to Exploit Sparsity
3.1.3 Implementation
3.2 SGD-QN: A Careful Diagonal Quasi-Newton SGD
3.2.1 Rescaling Matrices
3.2.2 SGD-QN
3.2.3 Experiments
3.3 Summary
4 Large-Scale SVMs for Binary Classification
4.1 The Huller: an Efficient Online Kernel Algorithm
4.1.1 Geometrical Formulation of SVMs
4.1.2 The Huller Algorithm
4.1.3 Experiments
4.1.4 Discussion
4.2 Online LaSVM
4.2.1 Building Blocks
4.2.2 Scheduling
4.2.3 Convergence and Complexity
4.2.4 Implementation Details
4.2.5 Experiments
4.3 Active Selection of Training Examples
4.3.1 Example Selection Strategies
4.3.2 Experiments on Example Selection for Online SVMs
4.3.3 Discussion
4.4 Tracking Guarantees for Online SVMs
4.4.1 Analysis Setup
4.4.2 Duality Lemma
4.4.3 Algorithms and Analysis
4.4.4 Application to LaSVM
4.5 Summary
5 Large-Scale SVMs for Structured Output Prediction
5.1 Structured Output Prediction with LaRank
5.1.1 Elementary Step
5.1.2 Step Selection Strategies
5.1.3 Scheduling
5.1.4 Stopping
5.1.5 Theoretical Analysis
5.2 Multiclass Classification
5.2.1 Multiclass Factorization
5.2.2 LaRank Implementation for Multiclass Classification
5.2.3 Experiments
5.3 Sequence Labeling
5.3.1 Representation and Inference
5.3.2 Training
5.3.3 LaRank Implementations for Sequence Labeling
5.3.4 Experiments
5.4 Summary
6 Learning SVMs under Ambiguous Supervision
6.1 Online Multiclass SVM with Ambiguous Supervision
6.1.1 Classification with Ambiguous Supervision
6.1.2 Online Algorithm
6.2 Sequential Semantic Parser
6.2.1 The OSPAS Algorithm
6.2.2 Experiments
6.3 Summary
7 Conclusion
7.1 Large Scale Perspectives for SVMs
7.1.1 Impact and Limitations of our Contributions
7.1.2 Further Derivations
7.2 AI Directions
7.2.1 Human Homology
7.2.2 Natural Language Understanding
Bibliography
A Personal Bibliography
B Convex Programming with Witness Families
B.1 Feasible Directions
B.2 Witness Families
B.3 Finite Witness Families
B.4 Stochastic Witness Direction Search
B.5 Approximate Witness Direction Search
B.5.1 Example (SMO)
B.5.2 Example (LaSVM)
B.5.3 Example (LaSVM + Gradient Selection)
B.5.4 Example (LaSVM + Active Selection + Randomized Search)
C Learning to Disambiguate Language Using World Knowledge
C.1 Introduction
C.2 Previous Work
C.3 The Concept Labeling Task
C.4 Learning Algorithm
C.5 A Simulation Environment
C.5.1 Universe Definition
C.5.2 Simulation Algorithm
C.6 Experiments
C.7 Weakly Labeled Data
C.8 Conclusion