Image Representations for Classi cation
How to properly represent images for challenging tasks such as classi cation or retrieval, and hence ll the semantic gap, remains a major issue in computer vision. One of the most popular tasks in computer vision is image classi cation which refers to the ability to predict a semantic concept based on the visual content of the image. In this context, the problem of image representation has been extensively studied in the last decades due to its large number of applications. Di erent methodologies have been explored to ful ll this goal. Biologically inspired models [Serre et al., 2007, Theriault et al., 2013] try to mimic the mammalian visual system, and show interesting performances for classi cation and detection. Recently, deep learning has regained a lot of attention due to the large success of deep convolutional networks in the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012)9. Using pixels as input, the network automatically learns useful image representations for the classi cation task. The results [Krizhevsky et al., 2012] reveal that deep learning signi cantly outperforms state-of-the-art computer vision representation competitors. In contexts where fewer training images are available for training, the visual Bag-of-Words (BoW) model [Ma and Manjunath, 1997, Sivic and Zisserman, 2003] proved to be the leading strategy in the last decade and remains a very competitive representation model. We present in the following two popular image representation models used for image classi cation: the visual Bag-of-Words and deep representations.
In the popular classi cation task, many approaches in the last decade have exploited the same classi – cation framework [Lazebnik et al., 2006, Yang et al., 2009, Liu et al., 2011], the only di erence between them is how they ne-tuned the low-level and mid-level feature extraction process to gain in recognition performance.
To better understand this rush for performance, we describe the popular visual Bag-of-Words image representation model that is illustrated in Fig. 1.1 and inspired from the Bag-of-Words used in text information retrieval. The text Bag-of-Words model represents a document by a histogram, it assigns to each term in a document a weight for that term that depends on the number of occurrences of the term
in the document. In the visual BoW model, images are rst decomposed as a set of local features, usually obtained by regular grid-based sampling (i.e., images are segmented as patches that are regularly spaced). Converting the set of local descriptors of an image into the nal image representation is performed by a succession of two steps: coding and pooling. In the original BoW model, coding consists in assigning each local descriptor to the closest visual word, while pooling averages the local descriptor projections. The nal BoW vector, which is the representation of the image, can thus be regarded as a histogram counting the occurrences of each visual word in the image. Since the notion of \word » is not as easily interpretable for image classi cation as for text retrieval, many e orts have been devoted to improve coding and pooling.
Fig. 1.2 illustrates the whole classi cation pipeline of the visual Bag-of-Words model for image clas-si cation. Local features are rst extracted from the input image, and encoded into an o -line trained dictionary. The codes are then pooled to generate the image signature. This mid-level representation is subsequently normalized before training the classi er, which is usually a Support Vector Machine (SVM) [Cortes and Vapnik, 1995] model. Each block of the gure is detailed in the following.
A pioneer work using the visual BoW framework is probably Netra [Ma and Manjunath, 1997] which exploits color feature dictionary learning.
Low-level feature extraction
The rst step of the BoW framework corresponds to local feature extraction. To extract local descriptors, one rst issue is to detect relevant image regions. Many attempts have been done to achieve that goal, generally based on a geometric criterion, using Harris a ne region detector [Harris and Stephens, 1988] or its multi-scale version [Mikolajczyk and Schmid, 2004], SIFT detector [Lowe, 2004], etc. However, for classi cation tasks, most evaluations reveal that a regular grid-based sampling strategy leads to optimal performances [Fei-Fei and Perona, 2005]. In each region of the image, SIFT descriptors [Lowe, 2004] are computed because of their excellent performances attested in various datasets.
Mid-level coding and pooling scheme
We explain here how to compute the mid-level representation of images in order to obtain their BoW representations.
Let X = (x1; : : : ; xj; : : : ; xN ) be the set of local descriptors in an image, where N is the number of local descriptors in the image. In the BoW model, the mid-level signature generation rst requires a set of visual words (also called codewords) bi 2 Rd Mi=1 (where d is the local descriptor’s dimensionality, and M is the number of visual words). This set of visual words is called visual codebook or dictionary, we denote it B.
Table 1.1 gives a matrix illustration of the mid-level representation extraction in the BoW pipeline, for scalar coding and pooling schemes. The set of local descriptors X is represented in columns, while the set of dictionary elements B occupies the rows. One column of the matrix thus represents the encoding of a given local descriptor xj into the codebook, that we denote as f(xj) = uj = (u1;j; u2;j; ; uM;j) 2 RM . In each row, aggregating the codes for a given dictionary element bi results in the pooling operation, denoted as g(X; f).
Codebook Di erent strategies to compute the codebook exist. The codebook can be determined with a static clustering, e.g., Smith and Chang [Smith and Chang, 1997] use a codebook of 166 regular colors de ned a priori. These techniques are generally far from optimal, except in very speci c applications. Usually, the codebook is learned using an unsupervised clustering algorithm applied on local descriptors randomly selected from an image dataset, providing a set of M clusters with centers bi. K-means is widely used in the BoW pipeline. Other approaches [Boureau et al., 2010, Goh et al., 2012] try to in-clude supervision to improve the dictionary learning. However, Coates and Ng [Coates and Ng, 2011] report that dictionary elements learned with \naive » unsupervised methods (e.g., k-means or even ran-dom sampling) are su cient to reach high performances on di erent image datasets. They also claim [Coates and Ng, 2011] that the recognition performance mostly depends on the choice of architecture. Speci cally a good encoding function (i.e., sparse or soft) is required.
Coding The coding step has attracted a lot of attention in the computer vision community, di erent coding methods have thus been proposed. In the original BoW model, the value of ui;j is 1 if bi is the nearest visual word of xj and 0 otherwise. This method is called hard assignment or hard coding. In other methods, such as the Local Soft Coding (LSC) algorithm [Liu et al., 2011], the value of ui;j is between 0 and 1 and grows with the relative proximity between xj and the codeword bi.
Note that some representation models, such as Fisher vector [Perronnin and Dance, 2007] or VLAD [Jegou et al., 2010] descriptors, use a vector representation of ui;j 2 RP , which results in vectors uj in RP M . For instance, the Fisher Vector model [Perronnin et al., 2010] extends the BoW by encoding the average rst- and second-order di erences between the descriptors and codewords.
Pooling The pooling step aggregates the resulting codes ui;j in order to compute the nal vector image representation z = fzigni=1 of the image. The two most popular pooling methods are the sum and the max poolings. Sum pooling counts the number of occurences of each codeword in the image PN (i.e., zi = j=1 ui;j). Max pooling detects for each codeword its maximum score among all the patches of the image (i.e., zi = maxj2f1;:::;Ng ui;j). Sum pooling is particularly useful when hard coding is applied since max pooling would return binary values of zi in this context. In the context of soft coding, where codes are usually real values between 0 and 1, the max pooling plays the role of a codeword detector. Other pooling methods have been proposed. For instance, the BossaNova representation [Avila et al., 2013] keeps more information than the BoW during the pooling step by estimating the distribution of the descriptors around each codeword.
Normalization and learning
Once the signatures of the di erent images in the dataset are computed, the classic approach is to learn a statistical machine learning model, usually an SVM learned using a one-against-all strategy. Some authors normalize the image representations before learning the classifers [Perronnin et al., 2010, Avila et al., 2013]. The choice of normalization also depends on the chosen representation model and classi er model (e.g., linear or non-linear SVM).
The pipeline described in Fig. 1.2 has been exploited in the last decade by many approaches on various datasets [Lazebnik et al., 2006, Yang et al., 2009, Perronnin et al., 2010, Liu et al., 2011]. In particular, many attempts for improving the coding and pooling steps have been done. Fig. 1.3 illustrates the perfor-mance evaluations of di erent state-of-the-art methods on the Caltech-101 [Fei-Fei et al., 2007] dataset. Most of them are extensions of the Bag-of-Words model which improve its mid-level representation.
The improvement of the mid-level step since 2006 signi cantly boosted performances: for exam-ple, using 30 training examples, there is a substantial gain of about 20 points from the baseline work of Lazebnik et al. [Lazebnik et al., 2006] ( 64% in 2006) to the pooling learning method of Feng et al. [Feng et al., 2011] ( 83% in 2011). This work on the BoW model over years demonstrates in this particular application task that the performance of machine learning methods is heavily dependent on the choice of data representations. Especially, as a preliminary of this thesis, we performed a thorough study of the di erent low-level and mid-level parameters of this pipeline that have an impact on classi cation performance. This study led to the following publications [Law et al., 2012a, Law et al., 2014a].
While the methods mentioned above are interested in manually tuning the extraction process to generate a useful image representation, other methods are concerned with questions surrounding how we can best learn meaningful and useful representations of data. The latter approach is known as representation learning and includes distance metric learning.
In the last decade, datasets of labeled images for computer vision tasks (e.g., image classi cation or detection) have grown considerably. Recently, the handcrafted extensions of the BoW pipeline have been substantially outperformed by the latest generation of Convolutional Neural Networks (CNNs) [LeCun et al., 1989] to many tasks that involve very large datasets, particularly in classi cation and de-tection [Krizhevsky et al., 2012, Szegedy et al., 2013]. The general idea of deep representations is to learn a hierarchy of representations from a dataset of images. CNNs have a substantially more sophisticated structure than standard (shallow) representations such as BoWs. They comprise several layers of non-linear feature extractors, and are therefore said to be deep. The representation at each level is composed of lower-level ones. Since they involve a very large number of parameters to learn, they bene t from large scale datasets of images to limit over tting10. Note that an architecture with at least four layers is considered to be a deep representation [Hinton et al., 2006, Bengio et al., 2007].
The architecture of CNNs takes raw input data at the lowest level (i.e., pixels) and processes them via a sequence of basic computational units until the data is transformed to a suitable representation in the higher layers to perform classi cation. Deep connectionist models learn a mapping from in-put data to output classes by attempting to untangle the manifold of the highly nonlinear input space [LeCun et al., 1989]. The strength of these models is that they are learned entirely in a supervised way from the pixel level to the class level.
Furthermore, recent work observed the relevance of deep models for transfer learning. Features learned on the ImageNet dataset may be used successfully in action recognition on a di erent benchmark dataset [Oquab et al., 2014]. Recently, the enthusiasm of computer vision researchers for CNNs has reached the same level as the enthusiasm they had for BoWs some years ago. Many recent works try to extend CNNs to increase performance in the same way as they extended BoW. For instance, at the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 201411, the rst and second places in the localisation and classi cation tracks respectively, were achieved by a very deep architecture that has more than 15 weight layers [Simonyan and Zisserman, 2014]. By contrast, the winner of the ILSVRC-2012 challenge had 8 layers.
The interested reader on deep learning of representations can refer to [Bengio, 2013].
Conclusion The choice of an appropriate image representation model remains a challenging task for the good performance of recognition methods in many computer vision contexts. For instance, deep learning which learns how to represent images directly from pixels, obtains state-of-the results in the context of image classi cation when the model is trained on very large datasets. From this observation, learning representations seems a promising paradigm to investigate for computer vision tasks such as image classi cation. In this thesis, we focus on a special case of representation learning which consists in learning an appropriate distance metric to compare images. For instance, we want to be able to determine whether two images represent the same object or not. For this purpose, we take some given image representation (e.g., BoW or deep features) as input of our model and infer a metric whose goal is to compare two (possibly never seen) images. Our task actually learns a new transformation of the input data such that the Euclidean distance in the transformed space satis es most of the desired properties. We present in the next section some interesting representation learning contexts where an appropriate metric is learned.
Metric Learning for Computer Vision
Metrics play an important role for comparing images in many machine learning and computer vision problems. In this section, we brie y present contexts where learning an appropriate metric may be useful. Some successful examples of applications that greatly depend on the choice of metric are: k-Nearest Neighbors (k-NN) classi cation [Cover and Hart, 1967] where an object is classi ed by a majority vote of its nearest neighbors: the object is assigned to the class most common among its k nearest neighbors. The nearest neighbors are determined based on a given metric (usually the Euclidean distance in the input space). Notably, a recent work [Mensink et al., 2013] has shown that a metric learned for k-NN reaches state-of-the-art performance when new images or classes are integrated in the dataset. K-Means clustering [Steinhaus, 1956, MacQueen et al., 1967] aims at partitioning the training set into K clusters in which each sample belongs to the cluster with the nearest mean. Test samples are assigned to the nearest cluster by distance. Information/Image retrieval [Salton, 1975, Goodrum, 2000] returns (the most) similar samples to a given query. Kernel methods [Scholkopf and Smola, 2001] exploit kernel functions, a special case of similarity metrics. The most popular example of kernel methods is the Support Vector Machines (SVM) model [Cortes and Vapnik, 1995] for which the choice of the kernel, which is critical to the success of the method, is typically left to the user. Moreover, when the data is multimodal (i.e., heterogeneous), multiple kernel learning (MKL) methods [Bach et al., 2004, Rakotomamonjy et al., 2008] allow to integrate data into a single, uni ed space and compare them.
Contexts that transform the input data into another space (usually with lower dimensionality) to make it interpretable are considered as metric learning approaches. Some examples are: Manifold learning Humans often have di culty comprehending data in many dimensions (more than 3). Thus, reducing data to a small number of dimensions is useful for visualization purposes. Moreover, reducing data into fewer dimensions often makes analysis algorithms more e cient, and can help machine learning algorithms make more accurate predictions. One approach to simpli cation is to assume that the data of interest lies on an embedded non-linear manifold within the higher-dimensional space. If the manifold is of low enough dimensionality, the data can be visualised in the low-dimensional space. The key idea of manifold learning is to learn an underlying low-dimensional manifold preserving the distances between observed data points. Some good representatives of manifold learning are Multidimensional Scaling (MDS) [Borg and Groenen, 2005] and Isomap [Tenenbaum et al., 2000]. Since manifold learn-ing methods do not consider labels of data to learn the low-dimensional space, they are considered as unsupervised metric learning approaches. Eigenvector methods Eigenvector methods such as Linear Discriminant Analysis (LDA) [Fisher, 1938] or Principal Component Analysis (PCA) [Galton, 1889, Pearson, 1901, Hotelling, 1933] have been widely used to discover informative linear transformations of the input space. They learn a linear transformation x 7! Lx that projects the training inputs in another space that satis es some criterion. For instance, PCA projects training inputs into a variance-maximizing subspace while LDA maximizes the amount of between-class variance relative to the amount of within-class variance. PCA can be viewed as a simple linear form of linear manifold learning, i.e., characterizing a lower-dimensional region in input space near which the data density is peaked [Bengio et al., 2013]. Visual high-level attributes While traditional visual recognition approaches map low-level image features directly to object category labels, some recent works have proposed to focus on visual attributes [Farhadi et al., 2009, Lampert et al., 2009]. Visual attributes are high-level descriptions of concepts in images. Generally, they have human-designated names (e.g., striped, four-legged, see Fig. 1.4) and are valuable tools to give a semantic meaning to objects or classes in various problems. They are also easy to interpret and manipulate. Visual attributes have shown their bene t in face veri cation [Kumar et al., 2009] and object classi cation [Lampert et al., 2009, Akata et al., 2013], particularly in the context of zero-shot learning for which the goal is to learn a classi er that must predict novel cat-egories that were omitted from the training set. It is particularly useful for contexts where datasets are large and dynamic (i.e., new images and new classes can be added and the semantics of existing classes might evolve). Indeed, when images of new labels are introduced in the dataset, discriminative models, such as SVM, have to be relearned at a relatively high computational cost in large scale settings (i.e., when the dataset contains more than 10 million images and 10,000 categories). Methods that learn an appropriate metric [Akata et al., 2013, Mensink et al., 2013] have shown promising results in these contexts since the learned metric can be generalized to new images.
In many attribute problems [Parikh and Grauman, 2011, Yu et al., 2013, Akata et al., 2013], a (lin-ear) transformation is learned so that low-level representations of images are projected into a high-level semantic space. Such a space is usually constructed so that each dimension describes the degree of pres-ence of an attribute in a given image. In other words, an image is described by a vector, and each element of the vector is the degree of presence of a given attribute in the image. In the high-level space, images can be semantically compared to one another.
One of the most popular contexts that compare images with attributes is the relative attribute problem [Parikh and Grauman, 2011]. In this problem, the representations of images in the high-level semantic space are learned relatively to the learned representations of other images. The original relative attribute problem considers relations between pairs of classes:
• inequality constraints: i.e., (e) (f): the presence of an attribute is stronger in class (f) than in class (e)
• and equivalence constraints: i.e., (g) (h): the presence of an attribute is equivalent in class (g) and class (h).
This type of relationship is particularly useful when a boolean score for the presence an attribute is di cult to annotate for a class or an image (see Fig. 1.5). Relative attributes have also been used in image retrieval [Kovashka et al., 2012] to nd objects that match semantic queries (e.g., an example query would be \Find a red shoe that is shinier than some given image of shoe »).
Conclusion Similarity metrics are key ingredients of many applications, such as image retrieval. The choice of metric is a di cult task and is determined by the problem at hand. An appropriate metric can be picked by experts in some problems, but it can also be learned to improve performance. In this dissertation, we are interested in supervised distance metric learning that we present in the following.
Throughout this thesis, Sd, Sd+ and Sd++ denote the sets of d d real-valued symmetric, symmetric positive semide nite (PSD) matrices and symmetric positive de nite matrices, respectively. The set of considered images is P = fIigIi=1, each image Ii is represented by a feature vector xi 2 Rd. For matrices A 2 Rb c and B 2 Rb c, denote the Frobenius inner product by hA; Bi = tr(A>B) where tr denotes the trace of a matrix. C(x) is the Euclidean projection of the vector or matrix x on the convex set C (see Chapter 8.1 in [Boyd and Vandenberghe, 2004]). For a given vector a = (a1; : : : ; ad)> 2 Rd, Diag(a) = A 2 Sd corresponds to a square diagonal matrix such that 8i; Aii = ai where A = [Aij]. For a given square matrix A 2 Rd d, Diag(A) = a 2 Rd corresponds to the diagonal elements of A set in a vector: i.e., ai = Aii. (A) is the vector of eigenvalues of matrix A arranged in non-increasing order. (A)i is the i-th largest eigenvalue of A. Finally, for x 2 R, let [x]+ = max(0; x).
Distance and similarity metrics
The choice of an appropriate metric is crucial in many machine learning and computer vision problems For some problems, the selected metric and its parameters are ne-tuned by experts, but its choice remains a di cult task in general. Extensive work has been done to learn relevant metrics from labeled or unlabeled data. The most useful property of metrics in this thesis is that they can be used to compare two never seen samples, i.e., that were not present in the training dataset.
We present here widely used metrics in computer vision, especially the Mahalanobis distance metric which is the focus of this thesis.
Table of contents :
1 Big (Visual) Data
1.1 Image Representations for Classication
1.1.1 Visual Bag-of-Words
1.1.2 Deep representations
1.2 Metric Learning for Computer Vision
1.3 Supervised Distance Metric Learning
1.3.2 Distance and similarity metrics
1.3.3 Learning scheme
1.3.4 Review of popular metric learning approaches
1.4 Training Information in Metric Learning
1.4.1 Binary similarity labels
1.4.2 Richer provided information
1.4.3 Quadruplet-wise approaches
1.5 Regularization in Metric Learning
1.5.1 Representative regularization terms
1.5.2 Other regularization methods in Computer Vision
2 Quadruplet-wise Distance Metric Learning
2.2 Quadruplet-wise Similarity Learning Framework
2.2.1 Quadruplet-wise Constraints
2.2.2 Full matrix Mahalanobis distance metric learning
2.2.3 Simplication of the model by optimizing over vectors
2.3 Quadruplet-wise (Qwise) Optimization
2.3.1 Full matrix metric optimization
2.3.2 Vector metric optimization
2.3.3 Implementation details
2.4 Experimental Validation on Relative Attributes
2.4.1 Integrating quadruplet-wise constraints
2.4.2 Classication experiments
2.5 Experimental Validation on Hierarchical Information
2.5.1 Formulation of our metric and constraints
3 Fantope Regularization
3.2 Regularization Scheme
3.2.1 Regularization term linearization
3.2.2 Optimization scheme
3.3 Theoretical Analysis
3.3.1 Concavity analysis
3.3.2 (Super-)Gradient of the regularizer
3.4 Experimental Validation
3.4.1 Synthetic example
3.4.2 Real-world experiments
4 Discovering Important Semantic Regions in Webpages
4.2 Constraint Formalization
4.2.1 Automatic generation of constraints
4.2.2 Similarity information provided by human users
4.2.3 Distance metric formulation
4.3 Visual and Structural Comparisons of Webpages
4.3.1 Regular grid segmentation
4.3.2 Structural segmentation
4.3.3 Integration of structural distance metrics
4.4 Experimental Results
4.4.2 Setup parameter
4.4.3 Evaluation protocol
4.4.4 Learning results without human supervision
4.4.5 Supervised learning results
4.4.6 Structural segmentation maps
A Positive Semidenite Cone
A.2 Rank of a Matrix
A.3 Projection onto the PSD Cone
B Solver for the Vector Optimization Problem
B.1 Primal Form of the Optimization Problem
B.2 Loss Functions
B.3 Gradient and Hessian Matrices