The pipeline described in Fig. 1.2 has been exploited in the last decade by many approaches on various datasets [Lazebnik et al., 2006, Yang et al., 2009, Perronnin et al., 2010, Liu et al., 2011]. In particular, many attempts for improving the coding and pooling steps have been done. Fig. 1.3 illustrates the performance evaluations of dierent state-of-the-art methods on the Caltech-101 [Fei-Fei et al., 2007] dataset.
Most of them are extensions of the Bag-of-Words model which improve its mid-level representation.
The improvement of the mid-level step since 2006 signicantly boosted performances: for example, using 30 training examples, there is a substantial gain of about 20 points from the baseline work of Lazebnik et al. [Lazebnik et al., 2006] ( 64% in 2006) to the pooling learning method of Feng et al. [Feng et al., 2011] ( 83% in 2011). This work on the BoW model over years demonstrates in this particular application task that the performance of machine learning methods is heavily dependent on the choice of data representations. Especially, as a preliminary of this thesis, we performed a thorough study of the dierent low-level and mid-level parameters of this pipeline that have an impact on classication performance. This study led to the following publications [Law et al., 2012a, Law et al., 2014a].
While the methods mentioned above are interested in manually tuning the extraction process to generate a useful image representation, other methods are concerned with questions surrounding how we can best learn meaningful and useful representations of data. The latter approach is known as representation learning and includes distance metric learning.
Metric Learning for Computer Vision
Metrics play an important role for comparing images in many machine learning and computer vision
problems. In this section, we brie y present contexts where learning an appropriate metric may be useful.
Some successful examples of applications that greatly depend on the choice of metric are: k-Nearest Neighbors (k-NN) classication [Cover and Hart, 1967] where an object is classied by a majority vote of its nearest neighbors: the object is assigned to the class most common among its k nearest neighbors. The nearest neighbors are determined based on a given metric (usually the Euclidean distance in the input space). Notably, a recent work [Mensink et al., 2013] has shown that a metric learned for k-NN reaches state-of-the-art performance when new images or classes are integrated in the dataset. K- Means clustering [Steinhaus, 1956, MacQueen et al., 1967] aims at partitioning the training set into K clusters in which each sample belongs to the cluster with the nearest mean. Test samples are assigned to the nearest cluster by distance. Information/Image retrieval [Salton, 1975, Goodrum, 2000] returns (the most) similar samples to a given query. Kernel methods [Scholkopf and Smola, 2001] exploit kernel functions, a special case of similarity metrics. The most popular example of kernel methods is the Support Vector Machines (SVM) model [Cortes and Vapnik, 1995] for which the choice of the kernel, which is critical to the success of the method, is typically left to the user. Moreover, when the data is multimodal (i.e., heterogeneous), multiple kernel learning (MKL) methods [Bach et al., 2004, Rakotomamonjy et al., 2008] allow to integrate data into a single, unied space and compare them. Contexts that transform the input data into another space (usually with lower dimensionality) to make it interpretable are considered as metric learning approaches. Some examples are the data of interest lies on an embedded non-linear manifold within the higher-dimensional space. If the manifold is of low enough dimensionality, the data can be visualised in the low-dimensional space. The key idea of manifold learning is to learn an underlying low-dimensional manifold preserving the distances between observed data points. Some good representatives of manifold learning are Multidimensional Scaling (MDS) [Borg and Groenen, 2005] and Isomap [Tenenbaum et al., 2000]. Since manifold learning methods do not consider labels of data to learn the low-dimensional space, they are considered as unsupervised metric learning approaches.
Eigenvector methods Eigenvector methods such as Linear Discriminant Analysis (LDA) [Fisher, 1938] or Principal Component Analysis (PCA) [Galton, 1889, Pearson, 1901, Hotelling, 1933] have been widely used to discover informative linear transformations of the input space. They learn a linear transformation x 7! Lx that projects the training inputs in another space that satises some criterion. For instance, PCA projects training inputs into a variance-maximizing subspace while LDA maximizes the amount of between-class variance relative to the amount of within-class variance. PCA can be viewed as a simple linear form of linear manifold learning, i.e., characterizing a lower-dimensional region in input space near which the data density is peaked [Bengio et al., 2013].
Supervised Distance Metric Learning
Throughout this thesis, Sd, Sd + and Sd ++ denote the sets of dd real-valued symmetric, symmetric positive semidenite (PSD) matrices and symmetric positive denite matrices, respectively. The set of considered images is P = fIigI i=1, each image Ii is represented by a feature vector xi 2 Rd. For matrices A 2 Rbc and B 2 Rbc, denote the Frobenius inner product by hA;Bi = tr(A>B) where tr denotes the trace of a matrix. C(x) is the Euclidean projection of the vector or matrix x on the convex set C (see Chapter 8.1 in [Boyd and Vandenberghe, 2004]). For a given vector a = (a1; : : : ; ad)> 2 Rd, Diag(a) = A 2 Sd corresponds to a square diagonal matrix such that 8i;Aii = ai where A = [Aij ]. For a given square matrix A 2 Rdd, Diag(A) = a 2 Rd corresponds to the diagonal elements of A set in a vector: i.e., ai = Aii. (A) is the vector of eigenvalues of matrix A arranged in non-increasing order. (A)i is the i-th largest eigenvalue of A. Finally, for x 2 R, let [x]+ = max(0; x).
Distance and similarity metrics
The choice of an appropriate metric is crucial in many machine learning and computer vision problems For some problems, the selected metric and its parameters are ne-tuned by experts, but its choice remains a dicult task in general. Extensive work has been done to learn relevant metrics from labeled or unlabeled data. The most useful property of metrics in this thesis is that they can be used to compare two never seen samples, i.e., that were not present in the training dataset. We present here widely used metrics in computer vision, especially the Mahalanobis distance metric which is the focus of this thesis.
Loss and surrogate functions
Choosing an appropriate loss function is not an easy task and strongly depends on the problem at hand. In order to explain surrogate functions, we rst need to introduce how training data is provided and exploited in supervised machine learning problems. We use the binary-class classication setting as a reference problem for explanation. We are given a set of n training samples f(x1; y1); (x2; y2); ; (xn; yn)g where each xi belongs to some input space X, usually Rd, and yi 2 f1; 1g is the class label of xi. The goal of classication machine learning algorithms is to nd a model that maximizes the number of correct labels predicted for a given set of test samples.
For this purpose, we are given a loss function L : f1; 1g f1; 1g ! R that measures the error of a given prediction. The loss function L takes as argument an arbitrary point (^y; y), and its value is interpreted as the cost incurred by predicting the label ^y when the true label is y. In the classication context, this loss function L is usually the zero-one (0/1) loss, i.e., L(^y; y) = 0 if y = ^y, and L(^y; y) = 1 otherwise. The goal is then to nd a classier, represented by the function h : X ! f1; 1g, with the smallest expected loss on a new sample. However, the probability distribution of the variables is usually unknown to the learning algorithm, and computing the exact expected value is not possible. That is why it is approximated by averaging the loss function on the training set (i.e., averaging the number of wrongly classied examples in the training set): Remp(h) = 1 n Xn i=1 L(h(xi); yi).
Pairwise Constrained Component Analysis (PCCA)
In order to deal with high-dimensional input spaces, PCCA [Mignon and Jurie, 2012] controls the rank of M 2 Sd + by directly optimizing over the transformation matrix L 2 Rde where e < d and M = L>L (in the same way as NCA). Optimizing over L ensures that the rank of M is low since e rank(L) = rank(M). Their constraints are similar to the ones used in ITML, i.e., 8(Ii; Ij) 2 S;DM2 (Ii; Ij) < 1, and 8(Ii; Ij) 2 D;DM2 (Ii; Ij) > 1. Instead of using a hinge loss (or its equivalent formulation with slack variables) to optimize the problem, they use another surrogate to the 0=1 loss, which is the logistic loss function (see Section 220.127.116.11): ` log(x) = 1 ln(1 + ex). It is a smooth and dierentiable approximation of the hinge loss function. Their problem is formulated as min MX (Ii;Ij )2fS[Dg ` log(yij D2M (Ii; Ij) 1) where yij = 1 if (Ii; Ij) 2 D and yij = 1 otherwise.
The advantage of their method is that it is fast to optimize, and their method returns a low rank PSD matrix. However, the problem is nonconvex in L, and they do not use an explicit regularization term (although they control the rank of M by optimizing over L). They then use early stopping to avoid overtting. Structural metric learning Another family of metric learning approaches [McFee and Lanckriet, 2010, Lim et al., 2013] are inspired from structural SVM [Tsochantaridis et al., 2005] which predicts a structure output. Structural SVM can be viewed as a generalization of multi-class SVM [Crammer and Singer, 2002] for which the set of predicted outputs contains structures (instead of labels for the classic SVM). The output of the structural SVM can be a parse tree, permutation, ranking, sequence alignment etc. The multiclass SVM [Crammer and Singer, 2002] learns a dierent model wy 2 X for each class y 2 f1; ;Kg where K is the number of classes in the training set. For each training example (x; y), the models are learned so that the prediction score for the true class y is greater (by a margin of 1) than the prediction scores for the other classes: 8y 6= y;w>y x w>y x + 1.
Table of contents :
1 Big (Visual) Data
1.1 Image Representations for Classication
1.1.1 Visual Bag-of-Words
1.1.2 Deep representations
1.2 Metric Learning for Computer Vision
1.3 Supervised Distance Metric Learning
1.3.2 Distance and similarity metrics
1.3.3 Learning scheme
1.3.4 Review of popular metric learning approaches
1.4 Training Information in Metric Learning
1.4.1 Binary similarity labels
1.4.2 Richer provided information
1.4.3 Quadruplet-wise approaches
1.5 Regularization in Metric Learning
1.5.1 Representative regularization terms
1.5.2 Other regularization methods in Computer Vision
2 Quadruplet-wise Distance Metric Learning
2.2 Quadruplet-wise Similarity Learning Framework
2.2.1 Quadruplet-wise Constraints
2.2.2 Full matrix Mahalanobis distance metric learning
2.2.3 Simplication of the model by optimizing over vectors
2.3 Quadruplet-wise (Qwise) Optimization
2.3.1 Full matrix metric optimization
2.3.2 Vector metric optimization
2.3.3 Implementation details
2.4 Experimental Validation on Relative Attributes
2.4.1 Integrating quadruplet-wise constraints
2.4.2 Classication experiments
2.5 Experimental Validation on Hierarchical Information
2.5.1 Formulation of our metric and constraints
3 Fantope Regularization
3.2 Regularization Scheme
3.2.1 Regularization term linearization
3.2.2 Optimization scheme
3.3 Theoretical Analysis
3.3.1 Concavity analysis
3.3.2 (Super-)Gradient of the regularizer
3.4 Experimental Validation
3.4.1 Synthetic example
3.4.2 Real-world experiments
4 Discovering Important Semantic Regions in Webpages
4.2 Constraint Formalization
4.2.1 Automatic generation of constraints
4.2.2 Similarity information provided by human users
4.2.3 Distance metric formulation
4.3 Visual and Structural Comparisons of Webpages
4.3.1 Regular grid segmentation
4.3.2 Structural segmentation
4.3.3 Integration of structural distance metrics
4.4 Experimental Results
4.4.2 Setup parameter
4.4.3 Evaluation protocol
4.4.4 Learning results without human supervision
4.4.5 Supervised learning results
4.4.6 Structural segmentation maps
A Positive Semidenite Cone
A.2 Rank of a Matrix
A.3 Projection onto the PSD Cone
B Solver for the Vector Optimization Problem
B.1 Primal Form of the Optimization Problem
B.2 Loss Functions
B.3 Gradient and Hessian Matrices