VQA architecture – Project topics materials

Get Complete Project Material File(s) Now! »

Visual Question Answering

This trend of interlinking image and text modalities within a joint learning process, whether for alignment or data generation, has opened the perspective on deeper image and language understanding problems for ML researchers. In this thesis, we focus our efforts on what is among the most challenging vision and language applications: Visual Question Answering (VQA). It consists in building systems capable of answering any natural language question about any image (see Figure 1.4). VQA involves a high-level understanding of multiple modalities, in a context where neither image nor text can be considered independantly from one another. The system should model the relations between an image and a textual question, and extract meaningful interactions that can help provide an answer. Interestingly, VQA falls within the broader field of human-machine interactions, and is a first step to vision-based interactive systems.
The VQA problem was formulated for the first time, to our knowledge, in (Malinowski et al. 2014a). As it has been noticed later in (Malinowski et al. 2014b; Antol et al. 2015), answering questions about images involves addressing multiple non-trivial issues. The system should be able to understand a large and varied quantity of semantic concepts, from both perceptual and linguistic modalities. Moreover, as the quantity of understood concepts grows, semantic boundaries between some of those concepts may become ambiguous and fuzzy, which should be accounted for in the model. Besides, a VQA model is also expected to have commonsense knowledge about the world. This capacity may be necessary to answer questions about what use to make of an object. Other complex types of questions require high-level scene understanding or visual reasoning capacities (Johnson et al. 2017a). They include object detection, finegrain recognition, attribute identification, visual relationship recognition, counting, comparing objects with respect to certain caracteristics, or even performing logical operations. In the example shown in Figure 1.4 the question “Is the lady with the blue fur wearing glasses?“ calls for these types of visual reasoning capacities. Finally, the problem of quantifying the performance of VQA models is not straightforward. As the goal is to mimic human response, it is necessary to deal with ambiguities, which can stem from many phenomena that are inherent to human judgement. For all these reasons, answering questions about images constitutes a major challenge for researchers in ML, and more generally in AI.
Most of the research conducted in VQA involves DL techniques, for their effec-tiveness and their ability to leverage large quantities of data. For each modality, representations are provided by powerful models, which may have been pre-viously trained to understand the semantics behind the data they encode (see Section 2.2). For the image representation, a ConvNet provides a vector (or possibly many) that contains information about the image content, the different objects that are depicted in the picture, and the attributes each one carries. As for the question representation, a recurrent model reads the sentence and computes another vector that incorporates information about the words and their contexts. Designing a VQA model consists in finding the structures that are appropriate to understand each modality with respect to the other, learn the relevant interactions between image and question representations, and predict the correct answer.

Contributions

In this thesis, we tackle the problem of VQA from a DL perspective. We first attack the issue of learning a multi-modal fusion module, central to VQA, that merges vectors by extracting their relevant correlations. In particular, we focus our efforts on the powerful solution provided by bilinear models, and study them under a tensor viewpoint. It constitutes the work we present in Chapter 3 and Chapter 4. At a higher level, answering questions about images requires more than simple multi-modal fusions. The different objects, their visual appearance, how they interact with each other, the spatial layout in which they are disposed, etc., should be understood by the model. In Chapter 5, we explore modeling the structure in the representation of the visual scene, and thus mimick some type of visual reasoning within the VQA system architecture itself.
This thesis is based on the material published in the following papers:
Hedi Ben-Younes*, Rémi Cadène*, Nicolas Thome, and Matthieu Cord (2017). “MUTAN: Multimodal Tucker Fusion for Visual Question Answering”. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV);
Hedi Ben-Younes, Rémi Cadène, Nicolas Thome, and Matthieu Cord (2019). “BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection”. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI);
Hedi Ben-Younes*, Rémi Cadène*, Nicolas Thome, and Matthieu Cord (2019). “MUREL: Multimodal Relational Reasoning for Visual Question Answering”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Industrial context

This thesis has been realized in collaboration with Heuritech, a french startup specialized in social media analysis for the fashion industry. It develops systems that extract information from user-generated posts, and aggregates this infor-mation into interactive dashboards for brands and retailers. Automatic image understanding is at the core of the company’s technology, and constitutes its principal research focus.
Most of the social media content that is relevant for the fashion industry is visual. It mainly consists in images, posted by influencers or simple users, that are focused around pieces of garment. Analyzing these pictures involves detecting the fashion-related objects, understanding their nature, finding their attributes (such as color, texture, …), and even sometimes identifying the exact brand and model name of the product. DL systems are an appealing approach for their performance, robustness and flexibility. This is why the research effort on visual recognition at Heuritech is mostly turned towards DL.
Even if the visual modality provides the more direct information for fashion-related content, viewing a social media post only as a picture may be restrictive. A caption is often associated to the image, with user-defined hashtags. Other users can show interest through likes and comments. Sometimes meta-data can be retrieved, such as the geo-localisation or the time at which the picture was taken. All these other signals may carry information under the light of which the image content could be understood. For Heuritech, exploring and developing methods for merging several modalities within ML systems is of high interest, and has been studied in the work of this thesis.

RELATED WORKS

As we stated in the introduction, deep multi-modal representations have been recently developped for numerous purposes. The work of this thesis is focused on Visual Question Answering (VQA), which consists in designing and training Machine Learning (ML) models to answer any free-form question, about any natural image. Architectures for VQA usually follow a generic template, depicted in Figure 2.1. Mono-modal encoders first provide high-level representations of visual and linguistic data. They constitute input modules to the actual VQA system, designed to fuse both modalities and reason about their interactions in order to provide an answer.
Learning to fuse both image and question representations to predict the answer is actually the core of Deep Learning (DL)-based VQA systems. Two functional components can be distinguished from each other, as they operate at different architectural levels. The final system performance heavily depends on these two components, and they constitute the research in VQA models almost exhaustively. First, the multi-modal fusion component operates at the vector level. It aims at learning an elementary function that takes two vectors as input and provides an output that extracts the relevant interactions between its inputs. The second important layer is the architectural design itself, which is often referred to as visual reasoning. It expresses the high-level capacities of the model, and conveys the inductive biases that the model can exploit using the training data. For instance, some models are designed to focus their attention on a subset of the image before predicting the answer. Others iterate multiple times over the image to refine their internal understanding of the scene, or can benefit from pairwise relations between objects, etc.
In this chapter, we review the related works that is relevant to study and build VQA systems. First, in Section 2.1 we give the generic setup for training a VQA architecture. As visual and linguistic representations constitue elementary modules used in all the VQA architectures, we review their design and training in Section 2.2. Then, we relate how the crucial problem of multi-modal fusion for VQA is tackled in Section 2.3. In Section 2.4, we review the different model structures, and how they induce behaviours that are akin to some types of reasoning processes. The different datasets that we use throughout this thesis are presented in Section 2.5. Finally, we expose our contributions and position them with respect to the existing literature in Section 2.6.

VQA architecture

In Figure 2.1 we show the architecture of a classical VQA system. An image representation (blue rectangle) is provided by a deep visual features extractor that yields one or many vectors. In parallel, a textual representation (red rectangle) is the output of a language model that goes through the question. Then, both these representations are merged (green rectangle), with possibly complex strategies based on multi-modal fusion and high-level reasoning schemes such as iterative processing or visual attention mechanisms. Finally, a prediction module (gray rectangle) provides its estimation of the answer to the question. The modules that compose the VQA system are usually designed to be end-to-end trainable on a dataset set D = f(vi, qi, ai)gi= 1:N. It contains ground-truth data where the question qi on the image vi has answer ai 2 A.
A VQA model can be seen as a parametric function fQ that takes (image, ques-tion) pairs as input and yields an answer prediction. Using the training data D, we can define an empirical loss function that quantifies how far the predictions of fQ are from the true answers: where l measures the difference between the model prediction fQ(vi, qi) and the ground truth ai.
As a result of the free-form answer annotation process, answers in A are possibly composed of multiple words. For this reason, early attempts (Malinowski et al. 2016; Gao et al. 2015) model the answer space as sentences, and learn to sequentially decodes each word of the true answer. However, the most widely adopted framework to represent the answer space is classification. In this setup, the scope of possible answers is fixed, each answer corresponds to a class, and the model computes a probability distribution over the set of classes given an image/question pair. As proposed in (M. Ren et al. 2015; Ma et al. 2016), the classes are obtained by taking the most frequent answers in the training set, regardless of whether they contain one or multiple words.
Following this setup, the VQA model outputs a probability distribution over possible answers fQ(v, q) 2 [0, 1]jAj, where each coordinate contains the estimated probability of the corresponding answer. To train the model, we use the cross-entropy loss function defined as: l( fQ(vi, qi), ai) = log fQ(vi, qi)[ai] (2.2)
The goal of the training stage is to find the parameters Q? that minimize the empirical loss in order to prevent the network from overfitting the training dataset D, we use an external validation set V to apply the early-stopping strategy. It constists in learning on the training set until the empirical loss stops decreasing on V. This technique, widely used in DL, acts as a regularizer.

Mono-modal representations

We aim at building models that are able to answer questions about images. Thus, a VQA model takes as input an image and a question, that are processed by mono-modal modules. The nature of these visual and textual representations will have a direct impact on the design and performance of a VQA system.

Image representation

Since the success of DL methods at the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (Russakovsky et al. 2015) challenge in 2012, these architec-tures keep improving the state-of-the-art on a large part of all the Computer Vision (CV)-related tasks: image classification, instance and semantic retrieval, semantic segmentation, object detection, and others. At the very basis of DL is the feedforward neural network. This model progressively maps raw inputs (e.g. image pixels) to outputs (e.g. a distribution over classes) through multiple transfor-mation layers, usually consisting in an affine projection followed by a non-linear activation function. A model that stacks many layers one after the other is referred to as a Deep Neural Network (DNN), and is able to perform highly non-linear complex mappings. The transformation performed by each layer is parametrized, and their optimal value is obtained by minimizing a problem-dependant loss function. A DNN being differentiable, optimizing its parameters is usually done by gradient-based methods such as Stochastic Gradient Descent (SGD). Computing the gradient of the loss function with respect to each parameter of the DNN is almost exclusively done with the backpropagation algorithm (LeCun et al. 1989).
Convolutional Neural Networks (ConvNets) (Fukushima 1980; LeCun et al. 1989; Krizhevsky et al. 2012) are a special kind of neural networks where the linear operation within each layer is a convolution, which makes these architectures well suited to process image data. Indeed, convolutions provide an effective way to share parameters between local feature extractors that go through the whole image, taking into account spatial coherence of the visual content. Similarly to classical DNNs, deep ConvNets stack multiple convolution layers separated by non-linear activation functions such as Rectified Linear Unit (ReLU) (Krizhevsky et al. 2012). Local pooling operations may also be integrated in the architecture, which make the representations invariant to local perturbations, and provide a control over their spatial size. As these models typically contain a huge number of parameters, large datasets are required if we want to train them. This is why these deep ConvNets are usually trained on ImageNet (Russakovsky et al. 2015), a dataset of 1 million images, each manually assigned to a label from a vocabulary of a thousand classes.
Collecting and labelling data is costly, which may limit the size of available data of a given task. However, a network trained on ImageNet (Russakovsky et al. 2015) has been shown to provide image representations that are generic enough to transfer to other tasks (Razavian et al. 2014; Azizpour et al. 2016). This property makes it possible to pre-train a ConvNet on ImageNet and slightly modify its weights (=fine-tuning) to adapt it to a new dataset, for which we have less labelled data.
In many VQA architectures, the image is represented using a pre-trained network as a features extractor. Each image is presented at the input of the network, and the forward pass is computed up until the penultimate layer, which outputs an internal representation that the ConvNet constructs for this image. This vector is used to characterize the image, and will be passed to the question answering system. In early VQA works, this single vector approch was very popular for its simplicity (Malinowski et al. 2016; Gao et al. 2015; M. Ren et al. 2015; Ma et al. 2016; Kim et al. 2016).
Incorporating spatial information. Unfortunately, information about spatial layout is hardly reachable from this single vector approach. Many questions in VQA may involve a fine understanding of the scene, and require to manipulate some spatial concepts such as on top of, left, right, etc. This is why modern VQA systems use more than a single vector to represent the image. One fairly simple technique is the Fully Convolutional Network (FCN) approach (Long et al. 2015; He et al. 2016). Instead of yielding a single vector that represents the whole image, an FCN preserves the spatial information throughout the network and provides a set of spatialy-grounded representation vectors organized in a 2-d grid (see the left image in Figure 2.2). All the vectors are computed simultaneously, in a single forward pass. As it has been done in (Long et al. 2015), one can easily transform a regular ConvNet into an FCN by increasing the size of input images and reshaping all the linear projections matrices into 1 1 convolutions. These FCN features have been extensively used in VQA as bags of vectors (Z. Yang et al. 2016; Fukui et al. 2016; Kim et al. 2017; Z. Yu et al. 2017). In rarer cases, the grid structure in the representation is leveraged by the VQA model to increase dependance of the output on the spatial layout of the image (Xiong et al. 2016; Z. Chen et al. 2017).
Since 2017, the standard in visual representations for VQA are the bottom-up features (Anderson et al. 2018). As illustrated in Figure 2.2, the fixed-grid structure is replaced by a set of object-focused regions. The bottom-up mechanism, based on Faster-RCNN (S. Ren et al. 2015), proposes image regions and associates each one with a representation vector. This model is trained on a separate dataset to detect objects, predict their class but also their attributes such as a color, a texture, a size, etc. Objects are detected in two stages. First, a small network called Region Proposal Network (RPN) is slid over convolutional features at an intermediate level of the ConvNet, and predicts a class-agnostic objectness score for several anchor boxes. After a step of Non-Maximum Suppression (NMS) with Intersection over Union (IoU) threshold, the top boxes are kept to be used as input for the second stage. In this stage, features that correspond to each region are extracted with a method called Region of Interest (RoI) pooling, and these vectors are used to classify each box proposal. In both stages, a refinement of the bounding box coordinates is also learnt. At the time of writing, these features are the ones that provide the best results in VQA (Y. Zhang et al. 2018; Jiang et al. 2018; Kim et al. 2018). Moreover, using these representations is also time-efficient: the typical number of regions per image is 36 for the bottom-up features, whereas it can reach 196 for FCN grids of size 14 14.
In our work, we propose several deep models based on ConvNet representations for their simplicity and compactness, and on FCN features to compare against leading methods (Chapter 3). As the work on bottom-up features (Anderson et al. 2018) was published in 2017, we use them in our contributions that came after and that are presented in Chapter 4 and Chapter 5.

READ The link between instructional leadership roles and school improvement

Textual embedding

In classical VQA systems, a sentence encoder provides an algebraic representa-tion of the question. This representation should encode precise and fine-grain semantic information about questions with variable length. Multiple models exist for such encoders, with different complexity and expressivity. Their choice and design hold a critical position in question understanding, and in the final performance of the VQA model.
To manipulate texts in natural language, we first need to define the atomic linguistic element. We could consider characters, words, bi-grams of words, etc. In the context of VQA, the usual atomic linguistic unit is words. Before representing arbitrarily long sentences, we need to define how words can be processed by ML models.
Word representation. The simplest way to represent a word is by its one-hot encoding. Given a finite list of words that constitute a vocabulary W, each word w is assigned to an integer index iw. The one-hot encoding of a word w in the vocabulary W is a binary vector vw whose size is the same as W, and where the k-th dimension is defined as follows: vw[k] = 0 else = i w (2.4)
This very high-dimensional vector is usually substituted by the more compact word embeddings, which provide a learnable representation of words. Each word w is assigned to a vector of parameters xw 2 Rd, referred to as the embedding of w. The dimension d is a hyperparameter, whose typical value is between 50 and 500. These vectors are initialized randomly, or using pre-trained models such as Word2Vec (Mikolov et al. 2013) or Glove (Pennington et al. 2014). Depending on the task on which these vectors have been trained, semantic and syntactic properties of words can be captured in their associated embedding. In particular, the euclidean distance between two embedding vectors reflect some type of semantic similarity between their associated words.
Bag of words. One of the simplest ways to represent an arbitrarily long sentence as a fixed-size vector is to view the sentence as a bag of words. The first step is to tokenize the text into a list of elementary language units (in our case, they are words): q = [w1, …, wT ]. Then the bag of words representation simply averages the word embeddings.
This representation has been used in early works of VQA (Antol et al. 2015; H. Xu et al. 2016; Zhou et al. 2015). In (Shih et al. 2016), a variant of this model separates different parts of information by splitting the question into 4 bins: the first bin contains the first two words of the question, the second bin contains the nominal subject, the third is composed of all other noun words, and the last one is made of all the remaining words. Word vectors are averaged within each bin, and all the 4 vectors are concatenated.
Recurrent networks. While these bag of word models are easy to implement, their effectiveness is limited by the fact that word order is not taken into account. More elaborate models are required to learn a fixed-dimensional representation of variable-length sequences. Recurrent Neural Networks (RNNs) (Elman 1990; Bodén 2001) have been developped to model time dependencies in sequences. In particular, they have been used to represent sentences (Mikolov et al. 2010) as they provide order-dependant representations. These networks operate over an input space, e.g. the word vectors, and an internal state that summarizes what has been processed by the network so far. Given a sequence of word vectors [x1, …, xT ], the RNN iteratively updates its internal hidden state s using a simple transformation: ht = f (W h!h ht 1 + W x!h xt) (2.6) where f is a non-linear activation function. Additionally, an output layer can provide predictions for each timestep yt = g(Wh!yht), where g is a problem-dependant activation function. The parameters of an RNN are trainable end-to-end with backpropagation. The output vectors [y1, …, yT ] can be used to calculate a loss with respect to some ground-truth value, or they can also be forwarded to another neural network. In Figure 2.3, we show different possible types of input-output designs summarized in (Karpathy 2015).
In practice, the classical RNN exhibits some problems regarding the propagation of gradients during learning, and seems to be unable to handle long-term depen-dencies. These phenomena, referred to as vanishing and exploding gradients, have been studied in (Bengio et al. 1994; Pascanu et al. 2013). To circumvent these issues, more elaborate recurrent models have been developped. In particular, the Long-Short Term Memory (LSTM) network was proposed in (Hochreiter et al. 1997a), and made popular by (Greff et al. 2015; Christopher Olah 2015). The core idea of this model is the cell state ct, that stores information from previous timesteps. The network can choose to remove information from the cell, update its value using the input, and output its content towards the hidden state ht. Mathematically, three gating operators are computed as functions of the input xt and the previous hidden state ht 1: the input gate it, the forget gate ft and the output gate ot:
it = s (Wi[ht 1, xt] + bi) (2.7)
ot = s (Wo[ht 1 , xt] + bo) (2.9)
ft = s Wf [ht 1 , xt] + bf (2.8)
where s is the sigmoid function, whose output is in [0, 1]. The network proposes a cell vector c˜t in the form c˜t = f (Wc[ht 1, xt] + bc) (2.10)
and this vector is used to update the network’s cell and compute ct following the equation ct = ft ct 1 + it c˜t (2.11)
Intuitively, ft selects the components of ct 1 that should be kept (ft close to 1) and those that should be forgotten (ft close to 0). Similarly, it chooses the components of c˜t that should be passed on to ct. Finally, following the same gating mechanism, the internal hidden state ht is updated by ht = ot tanh(ct) (2.12)
Other recurrent models have been developped using similar gating processes. Among them, the Gated Recurrent Unit (GRU) (Cho et al. 2014; Chung et al. 2014) is one the most popular, certainly because its performs as well as LSTMs for less parameters. In this simplified model, the cell state c is removed and the input and forget gates are merged into a single update gate. The equations of the GRU are as follows:
zt = s (Wz[ht 1, xt]) (2.13)
rt = s (Wr[ht 1, xt]) (2.14)
˜ (2.15)
ht = f (Wz[rt ht 1, xt])
ht = zt ht 1 + (1 ˜ (2.16)
zt) ht
The vast majority of VQA models use either LSTM or GRU to encode variable length sentences in a vectorial form. They keep only the last hidden state, or they use more elaborate models to aggregate the list of all output vectors, one for each timestep (Z. Yu et al. 2017; Z. Yu et al. 2018). As we saw in Section 2.2.1, image models can be pre-trained on the ImageNet dataset to learn to extract relevant visual feature vectors. Similar pre-training schemes exist for language models. In particular, the Skip-thought encoder (Kiros et al. 2015) learns the weights of a GRU neural network using a large quantity of unlabelled textual data. As is illustrated in Figure 2.4, a GRU first encodes a sentence into a unique vector, which is supposed to contain all the information about the sentence. This vector is then fed to a second GRU that will try to decode the previous sentence and the next sentence occuring in the text. This self-supervised model, inspired by Word2Vec (Mikolov et al. 2013), offers an effective pre-training scheme for sentence representations, and is often used to encode the question in VQA models (Kim et al. 2017; Z. Yu et al. 2017).
In this thesis, the question is systematically represented using the last internal state of GRU network pretrained on the Skip-thought task. Our models could surely benefit from the very recent advances in sentence representation such as ELMo (Peters et al. 2018) or Transformer-based architectures (Devlin et al. 2018; Radford et al. 2019).

Multi-modal fusion

Fusion in VQA

Multi-modal fusion is a critical component of VQA systems. Whether it consists in the whole system or a sub-part of it, we often need to build a learnable function that takes as input two vectors and outputs a single representation. Moreover, this representation is required to account for complex interactions between both modalities. This is why the VQA task has been a fertile playground for researchers on multi-modal fusion. More formally, given the question embedding q 2 Rdq and an image representation v 2 R dv , how do we design a learnable function f , parametrized by q, that provides an output y 2 Rdo such that y = f (q, v; q) ?
Early works have modeled interactions between multiple modalities with first-order models. The IMG+BOW model in (M. Ren et al. 2015) is the first to use a concatenation to merge a global image representation with a question embedding, obtained by summing all the learned word embeddings from the question. In (Shih et al. 2016), (image, question, answer) triplets are scored in an attentional framework. Each local feature is given a score corresponding to its similarity with textual features. These scores are used to weight region multimodal embeddings, obtained from a concatenation between the region’s visual features and the textual embeddings. The hierarchical co-attention network (J. Lu et al. 2016), after extracting multiple textual and visual features, merges them with concatenations and sums. To improve the expressive power in the multi-modal fusion, (Jabri et al. 2016) place a succession of fully-connected layers behind the concatenation of textual and visual features.
However, most of the recent work on multi-modal fusion is focused on bilinear models, as they are a core component of many state-of-the-art VQA models. They express each coordinate in the output as a function of pairwise products between dimensions of the two input vectors. In (Fukui et al. 2016), they use the fact that a bilinear model can be seen as a linear model whose inputs are all the possible products between dimensions of q and v: y = W [q v], where is the outer product, and [.] corresponds to the vectorization operation. The Multimodal Compact Bilinear (MCB) pooling is introduced to make the model tractable, and the calculation of the outer product avoided thanks to count-sketching techniques (Charikar et al. 2002). However, the best performing methods tackle this complexity issue from a tensor analysis viewpoint. In the recent Multimodal Low-rank Bilinear (MLB) pooling (Kim et al. 2017), the number of parameters in the bilinear model is controlled using the concept of tensor rank. They write the interaction between image and text vectors as a Hadamard product (or element-wise multiplication) between vectors: y = W o ((W qq) (W vv)). More recently, the Multi-modal Factorized Bilinear (MFB) fusion (Z. Yu et al. 2017)
20 related works
writes each output dimension as a scalar function of the shape y[k] = q>Wkv, and reduce the model complexity by constraining the rank of each matrix Wk.
As we can see, the task of VQA provides an attractive application for developping effective and efficient multi-modal fusions. In the following, we review other significant works that use bilinear models and tensor structurations for other purposes than VQA.

Bilinear models

In all the aforementionned contributions, the tensors we manipulate correspond to parameters of a model that we want to learn using standard DL tools, such as SGD optimizers and back-propagation. In this context, we focus on the represen-tation power and computational complexity reduction. It is worth mentionning that a long line of work on tensor structurations aims at reducing tensors that correspond to data. In these cases, inferring an interpretable structure through the decomposition is often desired. In the next paragraph, we briefly review some of these contributions.
Tensor decompositions in data analysis In the last century, multi-way tensor analysis has been an active field of research. In many problems where the data is acquired directly as different views, multilinear algebra provides an efficient framework for analyzing and understanding the complex underlying phenomena. In fluorescence spectroscopy analysis (Andersen et al. 2003), a low-rank model is used to understand complex chemometrics data, where different samples are measured at several emission wavelengths for several excitation wavelength, thus forming a three-way array. Tensor decomposition are also used to analyze Electroencephalography (EEG) data, in the form of a three-way array of size time samples frequency channel (Miwakeichi et al. 2004). An extensive review on tensor analysis for chemometrics is provided in (Bro 2006), where they cover other applications such as kinetics, magnetic resonance or chromatography. Blind source separation of statistically independant signals can be solved using a low-rank decomposition (Comon 2014), whereas more complex models can match the structure of correlated sources (Cichocki et al. 2015). In (Goovaerts et al. 2015), a low-rank model is used to detect irregular heartbeats in a three-way array of size channels time steps heartbeats. Tensor decompositions have also been used as data compression tools, as in (Wang et al. 2008) where a video is compressed using a Tucker model (that we present later in Chapter 3) on the third-order tensor of size height width f rames. The image classification problem has also been viewed through the lens of tensor decompositions. In (Vasilescu et al. 2002), the tensor framework is used to analyse face images of different persons under varying viewpoints, expressions and illumnations. Later, (S. Yan et al. 2007) propose a tensor-based multilinear discriminant analysis algorithm to classify face images. Finally, web data analysis has also benefitted from the development of tensor decompositions, as these data are often intrinsically multi-modal (Evrim Acar et al. 2005; Bader et al. 2007; Sun et al. 2005; G. Kolda et al. 2005). More details about the above references can be found in (Cichocki et al. 2015; Mørup 2011; E. Acar et al. 2009).
Compressing learning modules An important family of work uses multi-linear algebra and tensor structures to reduce the complexity of DL models. In (Lebedev et al. 2014), tensor decompositions are used to simplify and compress convolution kernels. The convolution filter banks of a trained ConvNet are decomposed into their canonical components using a low-rank approximation. After this architec-tural change, the network remains differentiable, which allows fine-tuning after the parameter compression. This technique shows important speed-up, under low performance drop. In (Y. Yang et al. 2017a), an application of tensor decom-positions for efficient parameter sharing in multi-task learning is presented. The matrix weights of different task-specific networks are considered as elements of a decomposition of the same underlying tensor, which allows an implicit sharing of trainable parameters and improves performance over having multiple single task independant learnings. More recently, (Ye et al. 2018) simplify the large linear projections W .x contained in an LSTM network (Hochreiter et al. 1997b) by reshaping the matrix W and the input vector x into multi-way arrays W and X . The tensor W is then compressed using the block-term decomposition, introduced in (De Lathauwer 2008). Compared to the classical LSTM, the com-pressed architecture converges faster, to a more accurate model, and with less parameters. In Chapter 4, we develop a multi-modal fusion module based on this tensor decomposition. All these works on compressing deep architectures through tensor reduction confirm the intuition that modern DL models (ConvNets, LSTMs and multi-task networks) are over-parametrized. Tensor decompositions constitute a promising research path towards lighter and more efficient models.

Table of contents :

1 introduction
1.1 Context
1.1.1 Joint image/text understanding
1.1.2 Visual Question Answering
1.2 Contributions
1.3 Industrial context
2 related works
2.1 VQA architecture
2.2 Mono-modal representations
2.2.1 Image representation
2.2.2 Textual embedding
2.3 Multi-modal fusion
2.3.1 Fusion in Visual Question Answering (VQA)
2.3.2 Bilinear models
2.4 Towards visual reasoning
2.4.1 Visual attention
2.4.2 Image/question attention
2.4.3 Exploiting relations between regions
2.4.4 Composing neural architectures
2.5 Datasets for VQA
2.6 Outline and contributions
3 mutan: multimodal tucker fusion for vqa
3.1 Introduction
3.2 Bilinear models
3.2.1 Tucker decomposition
3.2.2 Multimodal Tucker Fusion
3.2.3 MUTAN fusion
3.2.4 Model unification and discussion
3.2.5 MUTAN architecture
3.3 Experiments
3.3.1 Comparison with leading methods
3.3.2 Further analysis
3.4 Conclusion
4 block: bilinear superdiagonal fusion for vqa and vrd
4.1 Introduction
4.2 BLOCK fusion model
4.2.1 BLOCK model
4.3 BLOCK fusion for VQA task
4.3.1 VQA architecture
4.3.2 Fusion analysis
4.3.3 Comparison to leading VQA methods
4.4 BLOCK fusion for VRD task
4.4.1 VRD Architecture
4.4.2 Fusion analysis
4.4.3 Comparison to leading VRD methods
4.5 Conclusion
5 murel: multimodal relational reasoning for vqa
5.1 Introduction
5.2 MuRel approach
5.2.1 MuRel cell
5.2.2 MuRel network
5.3 Experiments
5.3.1 Experimental setup
5.3.2 Qualitative results
5.3.3 Model validation
5.3.4 State of the art comparison
5.4 Conclusion
6 conclusion
6.1 Summary of Contributions
6.2 Perspectives for Future Work
bibliography