CNN features for 3D objects and images

Get Complete Project Material File(s) Now! »

Domain gap between synthetic and real images

Leveraging 3D models enables retrieval on real images for which the query object is very specic and annotated photographs are not available or not easily available. It also enables articially extending an existing dataset, for example to obtain a diverse dataset with balanced orientations for each class. There is however a considerable visual dierence between the synthetically generated images and natural images, as the former usually lacks texture and context, whereas the latter is usually visually very rich in details. Figure 1.4 illustrates such a dierence. One way to overcome this problem would be to create realistic 3D scenes for each object, as an attempt to reduce these dierences. Such 3D scene creation would be very time consuming, as it usually requires not only good quality textures, but also a full scene model and a realistic lightning model. CNN features are able to extract both lowlevel and high-level information from images, but it is unclear whether they can directly be used in such disparate domains, or if substantial modications to these features are needed.

Handling diversity and ambiguity

Predicting the orientation of a whole object category in real images is a dicult task for a variety of reasons:
1. It requires a varying level of invariance for dierent properties. On one hand, it involves being invariant to illumination, texture and intraclass variability. On the other hand, it requires being discriminative enough to identify small angle perturbations, which don’t change much the image, as can be seen in Figure 1.5.
2. The pose of a rigid object instance or category, while well-dened for completely asymmetric classes, is usually ill dened when symmetries are involved. On may think about a square table for example: turning it by 90 degrees does not aect its geometry. As the orientation of an object is a continuous quantity, it is natural to express the pose estimation as a regression problem. There is a fundamental diculty with this formulation though, as it cannot represent well ambiguities in the prediction.

Origins of Articial Neural Networks

Articial Neural Networks are a family of parametric models that have a specic hierarchical structure. The structure is a combination of linear functions followed by non-linearities, which allows the model to learn complex non-linear functions in a compact manner. In this section, we give a brief overview of the mathematical models that originated articial neural networks.

Multi-layer neural networks

Despite the initial success of the perceptron in identifying digits in small images, its representational power is very limited. Indeed, it can only learn predictions which are linearly separable in the input space, which is rarely the case. Several extensions were proposed in order to overcome such limitations. In particular, having networks that contain internal representations (also called hidden layers) which are non-linear with respect to the input data allows for more expressive power. Unfortunately, the delta rule explained before does not apply in such situations, as it was specically tailored for the case where there is no hidden layers, so other learning techniques are needed. One early example is the Neocognitron [38], which stacked together several layers of linear functions followed by non-linearities, and used an unsupervised learning approach based on self-similarity between the input elements and the weights of the model to perform learning. Although such a learning approach allows to learn networks with hidden layers, there is no explicit constraint that ensures that the hidden layers learn an appropriate mapping. As we will see later in this section, it is possible to extend the delta rule to work for such multi-layer neural networks [97, 70]. Before that, we rst introduce a sub-category of the multi-layer networks called feed-forward neural network, which is a commonly employed architecture for several tasks.

Table of contents :

1 Introduction
1.1 Objectives
1.2 Motivation
1.3 Challenges
1.3.1 Computational challenges
1.3.2 Domain gap between synthetic and real images
1.3.3 Handling diversity and ambiguity
1.4 Contributions
1.4.1 Publications
1.4.2 Software contributions
1.5 Thesis outline
2 Background
2.1 Machine Learning Framework and Notations
2.1.1 Machine Learning
2.1.2 Supervised Learning framework
2.2 Articial Neural Networks
2.2.1 Origins of Articial Neural Networks
2.2.2 Multi-layer neural networks
2.2.3 Convolutional Neural Networks
2.3 Object detection
2.3.1 Classical view
2.3.2 CNN-based object detection
2.4 Pose estimation
2.4.1 Contour-based alignment
2.4.2 Part-based alignment
2.4.3 Category pose estimation
3 Preliminary studies
3.1 Optimizing memory use in CNNs
3.1.1 Overview
3.1.2 Computation graph construction from containers
3.1.3 Selecting reusable buers
3.1.4 Results
3.1.5 Conclusion
3.2 CNN features for 3D objects and images
3.2.1 Similarity measure
3.2.2 Feature representation
3.2.3 Aspect ratio ltering
3.2.4 Results
3.2.5 Conclusion
3.3 Multi-view 3D model retrieval
3.3.1 Method
3.3.2 Qualitative results
3.4 Conclusion
4 Detection
4.1 Introduction
4.1.1 Related Work
4.1.2 Overview
4.2 Adapting from real to rendered views
4.2.1 Adaptation
4.2.2 Similarity
4.2.3 Training data details
4.2.4 Implementation details
4.3 Exemplar detection with CNNs
4.3.1 Exemplar-detection pipeline
4.3.2 CNN implementation
4.4 Experiments
4.4.1 Detection
4.4.2 Algorithm analysis
4.4.3 Evaluation of the retrieved pose
4.4.4 Computational run time
4.5 Conclusion
5 Pose Estimation
5.1 Introduction
5.2 Overview
5.3 Approaches for viewpoint estimation
5.3.1 Viewpoint estimation as regression
5.3.2 Viewpoint estimation as classication
5.4 Joint detection and pose estimation
5.4.1 Joint model with regression
5.4.2 Joint model with classication
5.5 Experiments
5.5.1 Training details
5.5.2 Results
5.6 Conclusion
6 Discussion
6.1 Contributions
6.2 Perspectives
6.2.1 Improving memory optimization in training mode
6.2.2 Object compositing
6.2.3 Retrieval with millions of objects
6.2.4 Additional constraints in multi-view instance retrieval
6.2.5 Metric learning for domain adaptation