Computer vision, image and video understanding

Get Complete Project Material File(s) Now! »

Computer vision, image and video understanding

With the goal of improving image and video understanding, the computer vision field is using increasingly difficult proxy tasks to challenge itself. Solving these tasks pushes forward visual representation.
In the past two decades tremendous progress has been made. Object recognition is one of the tasks that helped the computer vision field grow. It aims at identifying objects in an image. The main difficulty of this task lies in the diversity of objects. They vary in scales, shapes, and colors depending on the illumination and the viewpoint of the scene. For a given object, for example a chair, multiple types of chairs exist with large variations in their appearance. Creating a model able to generalize and detect any type of chairs requires to have a semantic understanding of the concept of chair i.e. an object, you can sit on. However since the only goal of classification is to classify an image for the presence or absence of an object it is possible to mostly rely on context and textures instead of truly grasping the concept behind the object. A wooden object in the living room is more likely to be a chair than a car; nonetheless we would like to be able to detect a car even if it is in the middle of a living room.
Nevertheless promising results were obtained on the classification task and the field moved toward more challenging tasks, extending simple classification with localization. For example, the goal of object detection is to detect and localize an object with a bounding box. Finer localization was also tackled trying to segment the object or even its instance. Focusing on localization tasks helped improve visual representation with the need to capture high-level concepts while preserving low-level accuracy (pixel-wise).

Extending computer vision using semantic

With the growing performance of models on pure computer vision tasks and a first glimpse of their ability to capture semantics, it is a natural next step to strengthen the connection between the visual and semantic domains. This can be done by bridging the gap between visual and natural language tasks. The most direct example of such a task would be captioning, which aims at producing a short textual description corresponding to a given visual scene. In that form the textual caption can be seen as a high-level representation of the image, using natural language as representation encoding. The production of such captions requires a high understanding of the visual modality, including the detection of the objects in the scene, their attributes, and their relations with each other.
Another task connecting the visual and textual modality is Visual Question An-swering (VQA), its goal is to answer any natural language question about any type of images. It requires high-level understanding of both visual and textual modality as well as advanced reasoning. Answering a question might require multiple steps of reasoning each involving different parts of the image as well as high-level concepts, object relationship and even common sense.
Lastly, and the approach of interest in this thesis, we will focus our attention on joint semantic and visual representation. With the availability of multimodal data, it is now possible to directly align the visual domain with the semantic domain. Given a set of visual data where each image is paired to a caption describing its content, joint semantic and visual representation aims at creating a common representation space, where images and texts are comparable. Using the properties of such space to measure similarity between elements of different modalities open up multimodal retrieval: using a natural language query describing a scene to retrieve the picture of the corresponding scene from a large database. While retrieval is the most obvious application using joint multimodal space, it can also serve as a basis for any application requiring visual and semantic information such as captioning, visual grounding of text, textual guided image generation, object relationship detection.

Contribution

In this thesis, we aim to further advance image representation and understanding. Revolving around Visual Semantic Embedding (VSE) approaches, we explore different directions: First, we present relevant background in Chapter 2, covering images and textual representation and existing multimodal approaches. Then in Chapter 3 we propose novel architectures further improving retrieval capability of VSE. In Chapter 4 we extend VSE models to novel applications and leverage embedding models to visually ground semantic concept. Finally, in Chapter 5 we delve into the learning process and in particular the loss function.
This thesis is based on the work published in the following articles:
• Martin Engilberge, Louis Chevallier, Patrick Pérez, and Matthieu Cord. “Finding beans in burgers: Deep semantic-visual embedding with localization.” In: Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 3984–3993
• Martin Engilberge, Louis Chevallier, Patrick Pérez, and Matthieu Cord. “SoDeep: a Sorting Deep net to learn ranking loss surrogates.” In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, pp. 10792–10801
• Martin Engilberge, Louis Chevallier, Patrick Pérez, and Matthieu Cord. “Semantic-Visual Embedding with Distributed Self-Attention.” In: arXiv preprint. 2020

Industrial context

This thesis has been completed within Technicolor and Interdigital. Technicolor is a French corporation that provides services and products for the communication, media and entertainment industries. Interdigital is a mobile technology research and development company which acquired Technicolor Research & Innovation Activity in 2019.
In the context of movie production, a tremendous amount of multimedia content is produced daily. One motivation behind this thesis is being able to process this content so it can be easily retrieved later. An effective retrieval system requires a fine-grained visual representation capturing discriminative features, it also needs to be intuitive to use. Recent Deep Learning systems based on visual-semantic embedding combine both of these properties. For Technicolor/Interdigital, further exploring multi-modality interaction is of high interest and has been studied in this thesis.
In this thesis we aim at learning VSE models and the supervision comes as aligned pairs of images and their textual descriptions. One of the goals is to solve a retrieval task. In other words given a query and a dataset VSE needs to produce a ranking while being trained only with pairs, making it close to being weakly supervised.
In that sense it shares similarities with metric and representation learning in particular with respect to objective function. Several methods have been proposed to learn such metrics. In pairwise approaches, [95] minimizes the distance within pairs of similar training examples with a constraint on the distance between dissimilar ones. This learning process has been extended to kernel functions as in [65]. Other methods consider triplets or quadruplets of images, which are easy to generate from classification training datasets, to express richer relative constraints among groups of similar and dissimilar examples [8, 31, 91].

Neural networks

In feedforward neural network, the input data is progressively transformed by going through an alternating sequence of projection layers and non-linear activation function until a final projection is produced. A feedforward network with n layers can be formally represented as follows:
fw(x) = fwn ( fwn 1 (. . . fw2 ( fw1 (x)))) . (2.4)
Stacking a large number of such layers produces “deep” architectures which is the origin of the name Deep Neural Network (DNN) and deep learning. A DNN being differ-entiable, its parameters can be learned using gradient-based methods. Each parameters are updated using the gradient of the objective function with respect to the parameters. The gradient for each parameter is computed using the back propagation algorithm [78]. The strength of each step taken in the gradient direction is weighted by a learning rate h resulting in the following update of the parameters: w = w h ¶J (w) . (2.5)
In practice computing the exact gradient is too expensive because it requires evaluating the model on every example of the dataset. To overcome this problem, methods such as Stochastic Gradient Descent (SGD) [73, 84] or Adam [50] estimate the gradient from randomly sampled subset of data called minibatch.
Learning deep learning models from scratch is a challenging task. To obtain a model generalizing well to unseen data the ratio between the number of parameters jwj and the amount of data samples N need to be balanced. Training very large neural networks such as VGG-19 (144 million parameters resulting in 19.6 billion FLOP) or ResNet-152 (60 million parameters resulting in 11.3 billion FLOP) requires a large amount of data.
Fortunately, it has been shown that the features learned by such networks can easily be reused on different tasks [99]. Making it common to use a neural network pretrained on a large dataset and to transfer it to a new task or dataset. It often only requires to train additional adaptation layers.
Pushing the adaptation one step further it is now quite common to “finetune” the parameters of pretrained network. In other word when using a large pretrained network in a new model, the weights of the large neural network can be updated in and end-to-end manner, usually with a smaller learning rate h in order to preserve the low-level processing already learned. universal approximator While it is proven that multilayer feedforward net-works are universal approximator [14, 40], architecture design is still empirically driven. Indeed the Universal Approximation capability of neural network doesn’t consider the learnability properties of these networks. While multiple architectures could potentially reach the same optimal approximation given a proper combination of parameters, converging to this optimal solution in a very large parameter space is still a challenge.
Deep learning mostly relies on gradient descent optimization, since it is applied to non-convex functions, convergence to the global minima is not guaranteed. That is where the importance of neural network architecture comes in. Different architectures lead to different landscapes of the parameter space, and can improve the learnability in multiple ways: it can change convergence speed, stability, robustness to initialization, or help to converge to a better local minimum.
However using theory to model this characteristic is still an open problem and most of the approaches are only empirically evaluated. For this reason the deep learning field with its fast pace progress mostly adopted an empirical paradigm.

Mono-modal representation

Visual Semantic Embedding models aim to learn a joint visual and semantic representa-tion. Before exploring multimodal interaction, let’s first look at existing approaches to generate visual and textual representations.

Computer vision and image representation

The goal of computer vision is to capture high-level understanding from digital images. In this section we will quickly introduce traditional computer vision approaches and cover more extensively modern deep learning equivalent.
traditional computer vision Most computer vision tasks require to transform the low-level information of the visual domain to compact high-level vector represen-tation. The main challenge resides in projecting the pixel space of images to a smaller vector space while preserving the content information. Multiple handcrafted methods exist, such as appearance based methods representing an image using its edge, gradient or color histogram.
In the beginning of the 2000s dictionary learning method such as the bag of visual word models [64] were popular. Local features such as SIFT [62] were first extracted, then they were encoded as a function of a dictionary of visual words, and finally the visual code was aggregated to obtain a single vector representation. In the context of object classification one would typically use this single vector representation to separately train a classifier, be it a linear regression, a Support Vector Machine (SVM) [13] or a MultiLayer Perceptron (MLP).
In this framework classification is a two-step process: features are first extracted either by handcrafted method or learned with an unsupervised method and then a classifier is trained on the features computer vision with deep learning With deep learning models, it is now common to learn visual representation from scratch in an end-to-end manner. Low-level processing is learned at the same time as the higher-level reasoning. For example, in a Convolutionnal Neural Network (CNN) the first few layers are learned filters resembling Gabor filters and color blobs, while deeper layers capture more specialized features such as shapes or objects [54].
The convolution layer [56] was introduced to handle visual data, sharing the learn-able weights by sliding a single kernel over the visual space, thus greatly reducing the total number of parameters, allowing the processing of images of any size, and making the model invariant to translation.
AlexNet [54] is a CNN that consists of 5 convolutional layers. It uses Rectified Linear Unit (ReLU) for non-linear activation function, and its final layer is a linear projection outputting the scores for the 1000 classes of the ImageNet classification challenge.
When it was proposed in 2012, it reduced the top-5 classification error by half with an accuracy of 15.3% percent compared to the previous 26% of non-deep approaches. The top-5 error drops down to 7.3% percent in the next year’s challenge with the introduction of VGG [82] and its deeper architecture of 13 and 16 convolutional layers followed by 3 fully connected layers.
Finally, the next big improvement in the ImageNet challenge came from the intro-duction of the ResNet architecture by He et al. [37]: Assuming that it is easier to learn a residual mapping than an unreferenced mapping, this approach uses skip connections to force the network to learn a residual. It allowed for training of much deeper architec-tures with neural networks up to 200 layers. ResNet achieves a top-5 classification error of 3.6% on ImageNet. Diagram representation of VGG-19 and ResNet-34 architecture can be found in Figure 2.1.
In a residual network, the output of the n-th residual layer can be expressed as follows: xn+1 = fwn (xn) + xn, (2.6) with xn and xn+1 the input and the output of the layer n, the layer only needs to learn the residual mapping fwn (xn).
With the top-5 error on classification being extremely low, the field focus moved toward more difficult localization tasks. With respect to those tasks one key architecture innovation was the use of fully convolutional networks [61] by getting rid of the final linear layers. Fully convolutional networks are completely independent of the resolution of the input image, and preserve spatial information all the way until the final output.
Localization approaches will be more detailed in Section 2.5. We will only mention here a recent popular visual representation closely linked to the localization model, called bottom-up features [1]. They are produced by a faster R-CNN [75] trained on the object recognition task. Instead of producing features straight from a convolution network with a regular “grid-like” coverage of the entire image, they rely on a region proposal network extracting bounding boxes containing potential objects. Convolutional features are then extracted for each of these bounding boxes to form the final image representation. It helps the image representation to focus on every important parts of the image and capture more details by using higher resolution on individual bounding boxes.
As architecture design is still a very active field, it would be impossible to make an exhaustive list of architectures. We chose to present the architecture most commonly used for feature extraction and image representation. In this thesis we will build on top of these approaches: we will mostly use ResNet and Faster R-CNN as the backbone of our image representation pipeline.

READ Using Deep Learning to Extract Information from an IMU

Computer vision datasets

While important progress has been made on deep learning architectures, it was only made possible by the apparition of large-scale databases of richly annotated images. We will introduce some of the major computer vision datasets. Figure 2.2 contains example images of three of them.
Figure 2.2: Examples of image and annotation from different datasets. left: ImageNet, center: MS-COCO, right: Visual Genome. The illustrations are taken from [79], [60], and [53].
• ImageNet propose by [16] consists of 14.2 million images, and more than 20000 classes. However the ImageNet Large-Scale Visual Recognition Challenge only uses a subset of 1.2 million images and 1000 classes. The annotation consists of one multi-class label per image for image classification and also contains bounding boxes for object detection. It consists mainly of simple scenes with usually a focus on a single object.
• MS-COCO[60] contains 120 000 images, of various types of scene with a higher com-plexity than ImageNet. The annotation is also richer but covers a smaller number of classes. It has 80 classes, which are annotated by bounding box, segmentation masks, and instance segmentation mask. Every image is also annotated with 5 short textual captions describing the scene.
• Flickr-30K[100] contains 30 000 images. Similarly to MS-COCO the annotation consists of 5 short textual captions describing the scene.
• Visual Genome[53] contains 108, 000 images. It shares some images with MS-COCO and extends its the annotation with visual questions and answers, object attributes and relationships between objects.
Our work leverages all of the datasets listed above. ImageNet is used for the pre-training of the visual representation while Flickr-30K and MS-COCO are used as direct supervision to align the visual and textual domain. Finally, Visual Genome is only used for the evaluation of the localization tasks.

Natural language processing and text representation

Natural language processing shares a similar goal with computer vision: Capturing high-level understanding of digital text. The early days of Natural Language Processing (NLP) adopted a rule-based paradigm, using hand-written grammar and heuristic rules [93]. In this section we will focus on the recent statistical learning paradigm, and the use of deep learning model to derive meaningful vector representations of any given text.
As opposed to computer vision, in language, a single word already encodes high-level semantic while a single pixel is meaningless. Text also has a sequential structure: in a sentence the order of the words carries important information. Therefore the methods to compute textual representation need to take these differences into account and change from the ones used in computer vision word and sentence representation Learning textual representation is often unsupervised, and only rely on large corpora of texts with no extra annotation. Word and sentence representations are learned using sequence modeling models trained to predict the context of a given word/sentence. For example, Word2vec (W2v) word representation proposed by Mikolov et al. [66] uses the skip-gram architecture. With the intuition that words used in similar contexts should have similar representations, given the representation of a word the model will aim at predicting the surrounding words in a text. After seeing the same word in many different contexts, the model will converge toward a representation of the word that captures some of its original semantic.
Similarly, the sentence representation model skip-thought was introduced by Kiros et al. [52]. Using an encoder decoder architecture: A sentence is encoded to a representation space and then two decoders will try to predict the previous and following sentence of the original text.
Text processing pipelines often consist of multiple blocks that can be pre-trained independently. Word representations are used as the basis for sentence representations, which can later be used to derive large document representations. This hierarchical structure allows easier transfer learning with the possibility to reuse the each block and avoid the need for very large corpora of annotated data for every task. Often the pre-training of this block is done in an unsupervised manner only using context in which a word or a sentence is present.

Table of contents :

1 introduction
1.1 Context
1.2 Computer vision, image and video understanding
1.3 Extending computer vision using semantic
1.4 Contribution
1.5 Industrial context
2 literature review
2.1 Statistical learning
2.1.1 Supervised learning
2.1.2 Loss functions
2.1.3 Neural networks
2.2 Mono-modal representation
2.2.1 Computer vision and image representation
2.2.2 Computer vision datasets
2.2.3 Natural language processing and text representation
2.3 Multi-modal representation
2.3.1 Multimodal fusion
2.3.2 Visual semantic embeddings
2.4 Attention mechanism
2.5 Localization
2.6 Positionning
3 visual semantic embedding
3.1 Introduction
3.2 Visual Semantic Embedding
3.2.1 Textual path
3.2.2 BEAN Visual path
3.2.3 SMILE visual path
3.2.4 Learning and loss function
3.2.5 Re-ranking
3.3 Retrieval experiments
3.3.1 Training
3.3.2 Cross-modal retrieval
3.3.3 Discussion
3.4 Ablation and model understanding
3.4.1 BEAN: Changing pooling
3.4.2 SMILE: Impact of self-attention
3.4.3 Further analysis
3.5 Conclusion
4 application to localization
4.1 Introduction
4.2 Localization from visual semantic embedding
4.2.1 BEAN: Weakly supervised localization
4.2.2 SMILE: Object region to localization using Visual Semantic Embedding (VSE)
4.3 Experiments
4.3.1 The pointing game
4.3.2 Further analysis
4.4 Conclusion
5 ranking loss function
5.1 Introduction
5.2 Related works
5.3 SoDeep approach
5.3.1 Learning a sorting proxy
5.3.2 SoDeep Training and Analysis
5.4 Differentiable Sorter based loss functions
5.4.1 Spearman correlation
5.4.2 Mean Average Precision (mAP)
5.4.3 Recall at K
5.5 Experimental Results
5.5.1 Spearman Correlation: Predicting Media Memorability
5.5.2 Mean Average precision: Image classification
5.5.3 Recall@K: Cross-modal Retrieval
5.6 Discussion
5.7 Conclusion
6 general conclusion
6.1 Summary of contributions
6.2 Perspectives and future work
bibliography