Natural language processing and text representation
Natural language processing shares a similar goal with computer vision: Capturing highlevel understanding of digital text. The early days of Natural Language Processing (NLP) adopted a rule-based paradigm, using hand-written grammar and heuristic rules .
In this section we will focus on the recent statistical learning paradigm, and the use of deep learning model to derive meaningful vector representations of any given text. As opposed to computer vision, in language, a single word already encodes high-level semantic while a single pixel is meaningless. Text also has a sequential structure: in a sentence the order of the words carries important information. Therefore the methods to compute textual representation need to take these differences into account and change from the ones used in computer vision.
word and sentence representation Learning textual representation is often unsupervised, and only rely on large corpora of texts with no extra annotation. Word and sentence representations are learned using sequence modeling models trained to predict the context of a given word/sentence. For example, Word2vec (W2v) word representation proposed by Mikolov et al.  uses the skip-gram architecture. With the intuition that words used in similar contexts should have similar representations, given the representation of a word the model will aim at predicting the surrounding words in a text. After seeing the same word in many different contexts, the model will converge toward a representation of the word that captures some of its original semantic. Similarly, the sentence representation model skip-thought was introduced by Kiros et
al. . Using an encoder decoder architecture: A sentence is encoded to a representation space and then two decoders will try to predict the previous and following sentence of the original text. Text processing pipelines often consist of multiple blocks that can be pre-trained independently. Word representations are used as the basis for sentence representations, which can later be used to derive large document representations. This hierarchical structure allows easier transfer learning with the possibility to reuse the each block and avoid the need for very large corpora of annotated data for every task. Often the pre-training of this block is done in an unsupervised manner only using context in which a word or a sentence is present.
With visual and textual representation having a similar goal of finding a rich representation, we have seen the very different approaches required to represent the specificity of each modality efficiently. The textual modality has a strong structure and limited vocabulary, while the visual modality with its very large input space is extremely diverse.
Combining visual and textual modality is key to solve certain tasks and can generally help create better mono-modal representation. Visual information can help disambiguate textual representation, while the textual representation can add semantic and common sense knowledge to the visual representation.
Indeed by creating interactions between visual and textual concept when learning both representation it can help CNN reduce their bias toward texture  and help them take advantage of context and relation between objects. On the other hand, grounding language in the visual world might help language representation with ambiguities and world modeling: Associating objects more easily with its common disambiguating relationships and references.
Visual Semantic Embedding
The overall structure of the proposed approach, shown in Figure 3.1, follows the dual path encoding architecture of Kiros, Salakhutdinov, and Zemel  visible in Figure 2.4. We first explain its specifics before turning to its training with a cross-modal triplet ranking loss.
In this section we introduce two new architectures. Both share the same textual pipeline and only differ in their visual path. One is spatially supervised: based on a faster R-CNN object detector (See Figure 2.6), and the second one only relies on a ResNet (See Figure 2.1) pretrained on classification and can easily be trained in an end-to-end manner and preserves spatial information until the joint embedding.
SMILE embedding with distributed self-attention
Our main contribution consists in a vectorial, distributed self-attention mechanism in the visual path. Lightweight and efficient, it is designed to process multi-region features. By predicting multidimensional attention scores instead of just scalars, it allows for a finer re-weighting of the different regions of interest and, as a consequence, it produces a richer image embedding. The benefit of this novel attention mechanism first shows in text-image matching, but also for visual grounding. As an additional, stand-alone improvement of our system, we introduce a simple, yet effective, asymmetric re-ranking (RR) technique that takes full advantage of the multi-modality for improved retrieval performance.
The BEAN and SMILE models are both compared to similar existing architectures. Our models are quantitatively evaluated on a cross-modal retrieval task. Given a query image (resp. a caption), the aim is to retrieve the corresponding captions (resp. image). Since MS-COCO and Flickr-30K contain 5 captions per image, recall at r (“R@r”) for caption retrieval is computed based on whether at least one of the correct captions is among the first r retrieved ones. For MS-COCO, the evaluation is performed 5 times on 1000-image subsets of the test set and the results are averaged (5-fold 1k evaluation protocol).
Results for BEAN are reported in Table 3.1. We compare our model with similar methods. For caption retrieval, we surpass VSE++  by (5.2%,0.9%) on (R@1,R@10) in absolute, and by (3.9%,2.0%) for image retrieval. Three other methods are also available online, 2-Way Net , LayerNorm  and Embedding network . The three methods are based on VGG network while VSE++ which uses a ResNet reports much stronger performance. We consistently outperforms all similar models, especially in terms of R@1.
The most significant improvement comes from the use of hard negatives in the loss, without them recall scores are significantly lower (R@1 – caption retrieval: -20,3%, image retrieval: -16.3%). Note that in , the test images are scaled such that the smaller dimension is 256 and centrally cropped to 224 224. Our best results are obtained with a different strategy: Images are resized to 400 400 at inference irrespective of their size and aspect ratio, which our fully convolutional visual pipeline allows. When using the scale-and-crop protocol instead, the recalls of our system are reduced by approximately 1.4% in average on the two tasks, remaining above VSE++ but less so. For completeness we tried our strategy with VSE++, but it proved counterproductive in this case.
Ablation and model understanding
In order to get a better understanding of both proposed models, we conduct additional experiments and ablation study to identify which part of the models contribute to the overall performance.
BEAN: Changing pooling
One of the key elements of the proposed BEAN architecture is the final pooling layer, adapted from Weldon . To see how much this choice contributes to the performance of the model, we tried instead the Global Average Pooling (GAP)  approach. With this single modification, the model is trained following the exact same procedure as the original one. This results in less good results: For caption retrieval (resp. image retrieval), it incurs a loss of 5.3% for R@1 (resp. 4.7%) for instance, and a loss of 1.1% in accuracy in the pointing game.
SMILE: Impact of self-attention
To measure the impact of the self-attention in SMILE, we evaluate our model against two baselines. First, we visualize its effect using the visual grounding capability of the model, then we use cross-modal retrieval to measure performance differences.
Table of contents :
1.2 Computer vision, image and video understanding
1.3 Extending computer vision using semantic
1.5 Industrial context
2 literature review
2.1 Statistical learning
2.1.1 Supervised learning
2.1.2 Loss functions
2.1.3 Neural networks
2.2 Mono-modal representation
2.2.1 Computer vision and image representation
2.2.2 Computer vision datasets
2.2.3 Natural language processing and text representation
2.3 Multi-modal representation
2.3.1 Multimodal fusion
2.3.2 Visual semantic embeddings
2.4 Attention mechanism
3 visual semantic embedding
3.2 Visual Semantic Embedding
3.2.1 Textual path
3.2.2 BEAN Visual path
3.2.3 SMILE visual path
3.2.4 Learning and loss function
3.3 Retrieval experiments
3.3.2 Cross-modal retrieval
3.4 Ablation and model understanding
3.4.1 BEAN: Changing pooling
3.4.2 SMILE: Impact of self-attention
3.4.3 Further analysis
4 application to localization
4.2 Localization from visual semantic embedding
4.2.1 BEAN: Weakly supervised localization
4.2.2 SMILE: Object region to localization using Visual Semantic Embedding (VSE)
4.3.1 The pointing game
4.3.2 Further analysis
5 ranking loss function
5.2 Related works
5.3 SoDeep approach
5.3.1 Learning a sorting proxy
5.3.2 SoDeep Training and Analysis
5.4 Differentiable Sorter based loss functions
5.4.1 Spearman correlation
5.4.2 Mean Average Precision (mAP)
5.4.3 Recall at K
5.5 Experimental Results
5.5.1 Spearman Correlation: Predicting Media Memorability
5.5.2 Mean Average precision: Image classification
5.5.3 Recall@K: Cross-modal Retrieval
6 general conclusion
6.1 Summary of contributions
6.2 Perspectives and future work