Building a category-level object detector
Suppose we could store all possible images of a given category (for instance, cars). Let us call these images the prototypes. Once a test image is given, we could automatically compare this image to all our prototypes. If one prototype is exactly equal to the test image, the algorithm outputs that the test image contains an object of the same category than the prototype.
Of course, the number of possible images of a given category is enormous, and it is not even nearly possible to collect them all, to store them all, and to compare a test image with all of them.
One more feasible way to build such a detector is to store a reasonable amount of prototypes. In this case, no test image is exactly equal to any prototype. So the algorithm needs to measure the similarity between the test image and the proto-type. If, according to this measure, the test image is similar enough to one of the prototypes, they represent the same category.
One might decompose such a method in three main tasks:
• Picking or making good prototypes that span the set of possible objects in the category, and are dissimilar to objects outside of the category.
• Building a similarity measure that scores high to pairs of images of the same category, and that scores low to other pairs.
• Deciding from the similarity measurements between the test image and the prototypes whether or not it contains an object of the given category.
This thesis focuses on the second point: the construction of a similarity mea-sure.
Aligning to build a similarity measure
The misalignment issue. We would like to build a similar-ity measure to help us deter-mine whether or not two given images are from the same cat-egory. As an example, let us consider the two toy images of cars in Figure 1.1. They cor-respond to different viewpoints, and their shapes are also dissimilar (intra-class variation). They have, however, many common features (headlights, side mirrors, wheels, doors, …) that we might be able to rely on to assess their similarity. Unfortunately, we do not know, a pri-ori, which part of the first image corresponds to which part in the second image. This is the problem of misalignment.
In this section, we try to keep a generic definition of he word « part », which just have to be locally delimited in the image.
Feature matching. One common approach to handle that misalignment prob-lem is to match each part of the first image to the part in the second image which has the most similar local appearance. In addition, in the bag-of-word approach [Sivic and Zisserman, 2003; Lazebnik et al., 2003; Csurka et al., 2004], the local appearances are discretized in a finite set, allowing this type of similarity mea-sures to be computed via a simple histogram comparison, and therefore making it really fast to compute.
The drawback of such techniques is that they do not use the position of each part: a part in the first image can be matched anywhere in the second one. In other word, they waste the geometric information, such that any shuffled versions of two images would have the same similarity measure.
Regular grids. One usual approach [Dalal and Triggs, 2005a; Lazebnik et al., 2006a] to take advantage of the geometric information is to hypothesize that cor-responding parts are roughly at the same position in both images. One way to visualize that technique is to draw a grid on top of the two cars (Figure 1.1). The algorithm compares each cell of the grid of the first image with the cell of the grid of the second image at the same position. However, the same problem arises at a smaller scale: we still do not know how to align elements inside each cell. To tackle that, one can typically use the feature matching approach, described in the previous paragraph. The loss of geometric information is just local, and so is not as serious as in the pure bag-of-word approach.
In the example of Figure 1.1, one can see that the cells at the same position on both images (colored circles) do not match. As a consequence, a similarity measure based on the sum of the local similarities of cells at fixed positions would not work in this example.
However, two commonly used techniques can make it practical:
(1) Using the sliding window approach (Figure 1.2(a)) that consists in com-paring the prototypes with translated (and scaled) versions of the test image. It is equivalent to moving the grid in the test image. We can see in Figure 1.2(a) that, for specific choices of translation, it is possible to match many cells (green circles), but unfortunately there is no translation for which all cells match at the same time.
(b) Increasing the cell size improves toler-
ance to miss-alignment.
(a) Sliding the right grid with different trans-lations allows to match part of the cells (green) but not all simulatenously (red).
(2) Increasing the size of the cells [Lazebnik et al., 2006a]. We can see on figure 1.2(b) that this may allow for many good matches. One extreme example of such a technique is the bag-of-visual-words approach [Sivic and Zisserman, 2003; Lazebnik et al., 2003; Csurka et al., 2004] described in the previous para-graph, where each grid has only one cell. Unfortunately, this kind of method loses all geometric information inside each cell. So the bigger the cells are, the most information is lost.
A direct approach to the misalignment problem
Figure 1.2: In this work, we try to di-rectly find correspondances.
In this thesis, we use a much more di-rect approach to address the misalign-ment issue. Our algorithms try to find correspondences between parts in the first and second image (Figure 1.2), while keeping the geometric relationship between them. This is a hard task that requires more sophisticated techniques. We design a measure of alignment quality, and we define the similarity of the images as the quality of the best possible alignment between them. To compute this quality, we rely on two types of clues: (1) matching parts should have simi-lar local appearance, and (2) the set of correspondences (matches) should satisfy some geometric constraints:
Local appearance. Matching parts should have similar local appearance. This can be easier to estimate than global similarity because, usually, local deforma-tions are less severe than global ones (Figure 1.3). So, we can use a method that assumes local rigidity (or local affine rigidity, since perspective transformation can be assumed locally affine). One can also use local deformation-invariant features. This loses much less exploitable information than global deformation-invariant features.
Figure 1.3: Even though these two cars are impossible to align globally and rigidly, it is possible to do so locally. For instance the two circled wheels are superimposed in this figure to show that they align well.
Geometric constraints. Using local appearance only, one would just match each part of the first image to the most similar part in the second image. This is not satisfying because the rich geometrical relationship between parts is ig-nored. In other words, any spatially shuffled version of the images would give the same similarity measurement. So one usually applies geometric constraints to the matching process. As explained in the two next sections, one can hypothesize a parametric rigid transform, or a deformable alignment.
Aligning rigid objects
Alignment methods were proposed in the 1980s to detect exact replicates of a prototype [Faugeras and Hebert, 1983; Grimson and Lozano-Perez, 1984; Ayache and Faugeras, 1986]. For instance, one can see the results of [Huttenlocher and Ullman, 1987a] in Figure 1.4.
Figure 1.4: The algorithm of [Huttenlocher and Ullman, 1987a] searches for the prototype (left) in the test image (center) by looking for correspondences, generat-ing transformation hypotheses. Then, each hypothesized transformation is verified (right) by transforming the prototype in the test image (right), and checking that they match well, for each.
In such a situation, the alignment problem is reduced to finding the right rotation and translation from the 3D or 2D prototype to the 2D test image. Such transformations are parametric (typically with up to 6 parameters). So one just has to find a few correct correspondences between points in the prototype and test image to estimate the parameters, and, thus, get the transformation for all points.
Of course it is not easy to find even a few accurate correspondences.
Finding good correspondences. More recently, the Scale Invariant Feature Trans-form [Lowe, 2004b] (also known as SIFT) has become the standard tool to find accurate correspondence candidates in the rigid transformation case. First, it ex-tracts repeatable points in both images (i.e. points which are likely to be detected on the same position of objects independently of their viewpoint, sacle,…). Then it computes scale-and-rotation-invariant descriptors of them, and finally generates correspondence candidates between feature points which have similar descriptors and have no other ambiguous possible match.
Selecting coherent correspondences. However the generated correspondence candidates typically contain many outliers. To sort them out, the popular Ran-dom sample consensus [Fischler and Bolles, 1981] (also known as RANSAC) randomly samples a few of them which are used to estimate the parameters of the transformation. Then, it checks how many other correspondence candidates agree with this transformation (inliers). This is iterated many times, and RANSAC picks the transformation which contains the most inliers.
Those techniques combined have been very successful at matching exact same objects, especially when they are flat and highly-textured, and also at matching images of a scene seen from different viewpoints. This has led to many commer-cial systems ( for instance, for 3D reconstruction, dvd/cd/book cover recognition, logo detection, etc.) (e.g. Google Goggles1 , Acute3D2,LTU3,…).
Aligning deformable objects.
However in this thesis we are interested in recognizing object categories (such as cars, dogs, houses). In that case, there is no obvious parametrization of the trans-formation between two images of the same category (due to intra-class variation).
In this situation, it is not possible to compute the global alignment from the prototype to the test image based on a few correct correspondences only. There-fore, we cannot apply the techniques described in the previous section.
Constraining possible alignments. As seen in the previous section, when align-ing rigid objects, the set of possible alignments is rather small / low-dimensional. On the contrary, here, each part of the prototype can potentially match anywhere in the test image. But still not all the alignments are equally desirable. As an example, if the test image is a shuffled version of the prototype, every part of prototype does have a match in the test image, but, the alignment field is highly discontinuous. In contrast, we want to encourage more natural alignments with smooth deformation fields, where two nearby parts of the prototype have nearby matching parts in the test image.
From prototypes to models. Another problem compared to the rigid case is that it is possible that, due to severe intra-class and viewpoint variations, two images from the same category look very different, and therefore we cannot find a good alignment between them. We can only hope that a test image would be similar to at least one of the stored prototypes. So, as explained in section 1.2.1, we need a set of prototypes with the following ideal properties: (1) All the possible images of the given category have a high similarity with at least one prototype. (2) All images not from this category have low similarity with all prototypes. (3) The number of prototypes is small to reduce the computation time.
One can consider that actual images are not the best fit to satisfy those prop-erties. Many works use models rather than prototypes which are more abstract representations of objects from the same category. For instance, in the field of psychology, according to the theory of [Rosch, 1973], the human brain, in order to recognize dogs, do not actually store many images of dogs that it has seen by the past. Rather, it stores a model of an « average dog », and hierarchically, other average models for each dog race, and so on. Examples of model-based imple-mentations are described in the next paragraph.
Influential past works. The early work Pictorial Structure [Fischler and Elschlager, 1973a] uses a hand-designed model ( see Figure 1.5 ) in which a face is decom-posed into several supposedly rigid parts (eyes, mouth, ears,…) connected by stretchable springs. Their algorithm tries to align this model to a given test image, while the springs penalize deformation of the relative positions between parts. This work has introduced the concept of an abstract deformable model that can be aligned to test images. It has been a long source of inspiration: for instance, [Yuille, 1991] extends it by replacing the rigid parts by hand-designed deformable templates (see Figure 1.6).
Figure 1.5: [Fischler and Elschlager, 1973a] match their hand-designed model (left) to the test image (center). The result (right) is the positions of the sub-parts.
In their work, [Lades et al., 1993] extract parts on a regular grid and match prototypes of faces or office objects to the test images to detect these categories (see Figure 1.7). Although it is intellectually satisfying, in theory, to have a few « name-able » parts (ears,mouth,…), [Lades et al., 1993] have shown that rising the number of parts leads to better performance, even though they lose their meaning.
Following that trend, and after the success of the key point detectors [Schmid and Mohr, 1997; Lowe, 2004b], which, in theory, localize interesting and repeat-able points in images, several authors [Berg et al., 2005a; Leordeanu and Hebert, 2005a] have successfully developed algorithms that match hundreds of interest points from prototypes to test images in order to measure their similarity. Then, each test image is assigned to the category of its closest prototype.
In parallel, researchers (e.g. [Fergus et al., 2007; Felzenszwalb et al., 2008b; Leordeanu et al., 2007]) have used machine learning techniques to learn deformable models from training data. These algorithms try to produce models which are similar to training images of the same category and dissimilar to others. The most successful approach is [Felzenszwalb et al., 2008b] (see Figure 1.8) that nowadays produce state-of-the art results on standard detection datasets.
Figure 1.8: [Felzenszwalb et al., 2008b] have produced a state-of-the-art object detector based on a deformable models trained from data. First column: learned HOG root model. Second column: learned parts. Third row: learned deformation field for each part. Last column: detection example, red rectangle for root, blue for parts.
In the previous section, we have described why we want to align images. In this section, we review graph matching, which is a practical method to perform alignment, and which is the main theme of this thesis.
Graph-matching techniques represent an image with a graph. Typically, the nodes of this graph represent local regions of the image, and its edges represent the geometric relationships between nodes ( more concrete examples are described later).
The alignment of two images is reduced to the matching of two graphs. Let us define a node match as a pair of nodes one from each graph, and a graph match as a set of node matches. So instead of finding a generic alignment, one « just » has to find a graph match. Since the graph is a simplified representation of the image, this reduces the complexity of the problem.
Table of contents :
1.2 Motivation and aim of study
1.2.1 Building a category-level object detector
1.2.2 Aligning to build a similarity measure
1.2.3 A direct approach to the misalignment problem
1.2.4 Aligning rigid objects
1.2.5 Aligning deformable objects
1.3 Graph matching
1.4 Objective and contributions of the thesis
2 A tensorial formulation for hyper-graph matching
2.1.1 Motivation and goals
2.1.2 Problem statement
2.1.3 Organization and Contributions
2.2 Previous work
2.2.1 Historical review of literature
2.2.2 Spectral matching
2.2.3 Power iterations for eigenvalue problems
2.3 Proposed approach
2.3.1 Tensor formulation of hypergraph matching
2.3.2 Tensor power iterations
2.3.3 Tensor power iterations for unit-norm rows
2.3.4 Merging potentials of different orders
2.3.5 `1-norm constraint for rows
2.3.6 Building tensors for computer vision
2.4.1 Separable similarity measure
2.5.1 Synthetic data
2.5.2 House dataset
2.5.3 Natural images
2.5.4 3D object matching
2.5.5 Potentials of different orders
3 Graph matching for image categorization
3.1.1 Motivation and goals
3.1.2 Organization and contributions
3.2 Previous work
3.2.1 Review of literature
3.2.2 Caputo’s method
3.3 Proposed approach
3.3.1 Image representation
3.3.2 Matching two images
3.3.3 A kernel for image comparison
3.4.1 Ishikawa’s method
3.4.2 Proposed method: Curve expansion
3.5.1 Running time
3.5.2 Image matching
3.5.3 Image classification
4 Image alignment for object detection
4.1.1 Motivation and goals
4.2 Proposed approach
4.2.1 Image Model
4.2.2 Efficient Similarity Computation
4.3 Proof of concept: automated object discovery
4.4 Learning a detector
4.4.1 Latent SVMs
4.4.2 Hybrid method: Latent SVM and exemplar SVM
4.5 Implementation and Results
5 Conclusion and perspectives
5.2 Future work
5.2.1 Detection experiments
5.2.2 Aspects and full object model
5.2.3 Joint Alignment of multiple images
5.3 Other work
A Power iterations for unit norm rows