Mid-level Image Representations: BoW Models and Extensions

Get Complete Project Material File(s) Now! »

Scale Invariant Feature Transformation (SIFT)

SIFT [Lowe, 1999, 2004] is the most widely used local approach for recognition tasks. It was originally proposed as combination of a difference-of-Gaussian (DoG) interest region detector and a histogram of gradient (HoG) locations and orientations feature descriptor. However, both components have also been used in isolation. In particular, a series of studies has confirmed that the SIFT descriptor is suitable for combination with all of the above-mentioned detectors and that it usually achieves good performance [Mikolajczyk and Schmid, 2005]. The SIFT descriptor has 128-dimensional feature vectors. It is invariant to scale, rotation, affine transformations, and partially invariant to illumination changes.
Initially, the SIFT descriptor was proposed to enable efficient point-to-point matching in object recognition tasks [Lowe, 1999]. In more recent works, this technique have been explored in the Bag-of-Words representation [Sivic and Zisserman, 2003], formally introduced in Section 2.2. Several extensions of the original SIFT have been proposed in the literature. For example, PCA-SIFT [Ke and Sukthankar, 2004] applies PCA on normalized gradient patches to reduce the size of the original SIFT descriptor. RIFT [Lazebnik et al., 2005] divides each image patch into concentric rings of equal width to overcome the problem of dominant gradient orientation estimation required by SIFT. GLOH [Mikolajczyk and Schmid, 2005] extends SIFT by changing the location grid to a logpolar one and using PCA to reduce the size. Rank-SIFT [Li et al., 2011] sets each histogram bin to its rank in a sorted array of bins. Also, different ways of extending the SIFT descriptor from grey-level to color images have been proposed by different authors [Bosch et al., 2006; van de Weijer and Schmid, 2006; Burghouts and Geusebroek, 2009; van de Sande et al., 2010].

Speeded Up Robust Features (SURF)

SURF [Bay et al., 2006, 2008] is a scale and rotation-invariant interest point detector and descriptor. The detector is based on the Hessian matrix, but rather than using a different measure for selecting the location and the scale (as was done in the Hessian- Laplace detector), Bay et al. apply the determinant of the Hessian for both. The descriptor, on the other hand, describes a distribution of Haar-wavelet responses within the interest point neighborhood. The SURF descriptor is based on similar properties of localized information and gradient distribution as SIFT, with a complexity stripped down even further.
The main interest of the SURF descriptor lies in its fast computation of approximate differential operators in the scale-space, based on integral images and box-type convolution filters. Moreover, only 64 dimensions are used, reducing the time for feature computation and matching, and increasing simultaneously the robustness.

Mid-level Image Representations: BoW Models and Extensions

Mid-level feature extraction aims at transforming low-level descriptors into a global and richer image representation of intermediate complexity [Boureau et al., 2010a]. That image representation is commonly referred to as mid-level representation, since global features built upon low-level ones typically remain close to image-level information without attempts at high-level semantic analysis. In order to get the mid-level representation, the standard processing pipeline follows three steps [Boureau et al., 2010a]: (i) low-level local feature extraction (previously addressed in the Section 2.1), (ii) coding, which performs a pointwise transformation of the descriptors into a representation better adapted to the task and (iii) pooling, which summarizes the coded features over larger neighborhoods. Classification algorithms are then trained on the mid-level vectors obtained (see Section 2.4). In this section, we approach the family of mid-level representations which is most central to this dissertation. In particular, we focus on the Bag-of-Visual-Words representation (BoW) [Sivic and Zisserman, 2003], the most popular mid-level image representation. Here, instead of an exhaustive survey, we opt for a more formal development: our target is to lay out the mathematical cornerstones common to all BoW representations, exploring how those cornerstones have been established in early works, and how they are evolving in very recent works.

BoW Normalization Techniques

One possibility of increasing the systems’ performance is to carefully examine the feature normalization techniques. In particular, large margin classifiers are known to be sensitive to the way features are scaled (see, for example [Chang and Lin, 2011], in context of SVM). Therefore, despite the fact it has been neglected so far in most research papers in the literature, the mid-level feature normalization is a crucial step. In the Bag-of-Words (BoW) model proposed by Sivic and Zisserman [2003], the vector components are weighted by a tf-idf transformation, where tf means ‘termfrequency’ and idf means ‘inverse document-frequency’. The idea is that word frequency weights words occurring often in a particular document, and thus describe it well, whilst the inverse document frequency downweights words that appear often in the dataset.
A technique usually regarded as part of term weighting is to normalize of the term count by the number of terms in the document (i.e., document length) into a unit-length term frequency vector. This ℓ1 normalization eliminates the difference between long and short documents with similar word distribution. For images, this means normalizing the count of visual words by the total number of local descriptors in each image, which varies greatly according to the complexity of the image scene. Recently, the ℓ1 and ℓ2 normalizations have been widely used to normalize the BoW-based feature vectors, such as [Nister and Stewenius, 2006; J´egou et al., 2010; Perronnin et al., 2010c; Avila et al., 2011; Chatfield et al., 2011; Picard and Gosselin, 2011; Negrel et al., 2012]. The normalization policy can be driven by the kernel choice, and different kernels lead to different normalization strategies. For example, ℓ2 normalization is appropriate when using linear kernels, whereas ℓ1 is optimal when using χ2 or intersection kernels, see [Vedaldi and Zisserman, 2012]. However, there is no general agreement on the benefit of performing these normalizations, because they can discard relevant information.
Contradicting experimental results have been reported in the literature, and the optimal normalization policy remains largely data-dependent. For example, some authors reported that ℓ2 normalization negatively impacts performances, and therefore they chose not performing any normalization method (e.g., [Liu et al., 2011a; Boureau et al., 2011]). In [Yang et al., 2007], this normalization factor is evaluated in two benchmarks, PASCAL VOC and TRECVID. The authors have contradicting observations between the two datasets regarding the normalization factor. In PASCAL VOC, normalized features consistently outperforms un-normalized ones. However, in TRECVID the un-normalized features are always better than their normalized counterparts. According to Yang et al. [2007], a plausible explanation is that, PASCAL VOC has images of various sizes, and its classification performance benefits from the normalization factor which eliminates the difference on image sizes. This is not the case with TRECVID, which contains video frames of identical size, and normalization decreases the performance by suppressing the information on the number of visual words in each video frame.


Machine Learning Algorithms for Image Classification

Machine Learning is concerned with the design and development of algorithms that allow computers to learn based on data. The most fundamental distinction in machine learning is that between supervised and unsupervised learning algorithms.
In supervised learning problems, a machine learning algorithm induces a prediction function using a set of examples, called a training set. Each example consists of a pair formed by an observation annotated with a corresponding label. The goal of the learned function is to predict the correct label associated with any new observation. When the labels are discrete, the task is referred to as a classification problem.
Otherwise, for real-valued labels, the task is referred to as a regression problem. The main goal of a machine learning algorithm is to perform correct predictions for previously unknown observations. Therefore, machine learning is not simply a question of remembering, but mainly of generalizing a model to unknown cases. In practice, a testing set, i.e. a set of examples never seen by the learning algorithm during the training phase, along with a performance measure are thus employed to evaluate the generalization ability of the learned model.
In unsupervised learning problems, one can consider unlabeled training examples and try to uncover regularities in the data. One can also make use of both labeled and unlabeled data for training (typically a small amount of labeled data with a large amount of unlabeled data). This is referred to as semi-supervised learning problem.
In this section, we only consider some of the most successful supervised learning algorithms for image classification problems. We start by presenting the Support Vector Machines (Section 2.4.1), a very popular and powerful learning technique for data classification. Next, we discuss ensemble techniques (Section 2.4.2), a strategy which weighs several individual classifiers, and combines them in order to obtain a classifier that outperforms every single one of them. Finally, we approach the k-Nearest Neighbor classifier (Section 2.4.3).

Biologically-inspired Models

Biologically-inspired computational models for image classification attempt to simulate the process of visual cortex in human vision task [Fukushima and Miyake, 1982; Riesenhuber and Poggio, 1999; Serre et al., 2007; Th´eriault et al., 2012]. Research on biological visual systems has been an important field of study since the awarded work of Hubel and Wiesel [1959, 1968]. Their studies suggested that the processing in the visual cortex follows a hierarchical structure. Thereafter, various hierarchical image classification approaches have been developed. For example, Fukushima and Miyake [1982] proposed Neocognitron, a hierarchical multi-layered network that is capable of merging simple visual features into a more complex whole while retaining some degree of invariance to basic visual transforms.
One biologically-inspired model which has been received attention in recent years comes from the HMAX model of Riesenhuber and Poggio [1999], which focuses less on learning and more on designing simple operations inspired by the visual cortex. This model alternates layers of features extraction with layers of maximum pooling. Several extensions of the HMAX model have been suggested. For instance, Serre et al. [2007] extended the HMAX to add multi-scale representations as well as more complex visual features. Mutch and Lowe [2008] improved the HMAX of Serre et al. [2007] by tuning the complex visual features to the dominant local orientations.
Th´eriault et al. [2011, 2012] proposed to build complex features in terms of the local scales of image structures.
Despite the success of the HMAX model, there are two important limitations of such a model [Han and Vasconcelos, 2010]. First, because the organization of the network lacks a clear computational justification, the HMAX model lacks a principled optimality criterion and training algorithm. That limits its relevance as an explanation for the underlying biological computations. Second, and quite importantly, the HMAX model doesnot account for the psychophysical and physiological evidence on the important role played by visual attention in processes such as object recognition.
Another biologically-inspired model is the Convolutional Neural Network (CNN), introduced by LeCun et al. [1990]. In the original CNN, parameters of the whole network are trained in a supervised manner using the error backpropagation algorithm.
For image classification tasks, several variants of CNN have emerged either supervised feature learning [Nebauer, 1998; LeCun et al., 2004, 2010; Sermanet et al., 2012; Krizhevsky et al., 2012] or unsupervised feature learning [Huang and LeCun, 2006; Ranzato et al., 2007b; Kavukcuoglu et al., 2010; Taylor et al., 2010]. Most forms of CNN models, besides being biologically inspired, should also be considered “deep” models, i.e., models characterized by the presence of several layers of learning nodes (“neurons”), in contrast to the “shallow” models that have at most three layers (input, a single hidden layer, and output). We will explore those “deep” models in more detail in the next section.

Table of contents :

List of Figures
List of Tables
1 Introduction 
1.1 Motivation
1.2 Challenges
1.3 Hypotheses
1.4 Contributions
1.5 Outline
2 Literature Review 
2.1 Low-level Visual Feature Extraction
2.1.1 Feature Spaces
2.1.2 Local versus Global Features
2.1.3 Local Feature Detection Operators
2.1.4 Feature Description
2.2 Mid-level Image Representations: BoW Models and Extensions
2.2.1 Early Techniques
2.2.2 Current Formalism
2.2.3 BoW-based Approaches
2.3 Feature Normalization
2.3.1 Dimensionality Reduction
2.3.2 BoW Normalization Techniques
2.4 Machine Learning Algorithms for Image Classification
2.4.1 Support Vector Machine
2.4.2 Ensemble Techniques
2.4.3 k-Nearest Neighbor
2.4.4 Visual Codebook Learning
2.5 Other Approaches for Image Classification
2.5.1 Biologically-inspired Models
2.5.2 Deep Models
2.5.3 Part-based Category Models
2.6 Conclusion
3 Challenges and Benchmarks Addressed 
3.1 MIRFLICKR Challenge
3.2 ImageCLEF Evaluation Campaign
3.2.1 ImageCLEF 2011 Photo Annotation Challenge
3.2.2 ImageCLEF 2012 Photo Annotation Challenge
3.3 PASCAL VOC Challenge
3.4 15-Scenes Dataset
3.5 Oxford Flowers Dataset
3.6 Conclusion
4 BossaNova Representation 
4.1 Coding & Pooling Matrix Representation
4.2 Early Ideas
4.3 BossaNova Pooling Formalism
4.4 Localized Soft-Assignment Coding
4.5 Normalization Strategy
4.6 Computational Complexity
4.7 BossaNova & Fisher Vector: Pooling Complementarity
4.8 BossaNova as a Fisher Kernel Formalism
4.9 Conclusion
5 Experimental Results 
5.1 BOSSA to BossaNova Improvements Analysis
5.2 BossaNova Parameter Evaluation
5.2.1 Codebook Size
5.2.2 Bin quantization
5.2.3 Minimum Distance αmin
5.3 Comparison of State-of-the-Art Methods
5.3.1 Results for MIRFLICKR
5.3.2 Results for ImageCLEF 2011 Photo Annotation
5.3.3 Results for PASCAL VOC 2007
5.3.4 Results for 15-Scenes
5.4 BossaNova in the ImageCLEF 2012 Challenge
5.5 Conclusion
6 Application: Pornography Detection 
6.1 Related Work
6.2 The Pornography Dataset
6.3 Our Scheme
6.4 Experiments
6.4.1 Experimental Setup
6.4.2 Results
6.4.3 Discussion
6.5 Conclusion
7 Conclusion 
7.1 Contributions
7.2 Future Work
7.3 Publications
A BossaNova Fisher Derivation 


Related Posts