Get Complete Project Material File(s) Now! »
In this section we review work that is closely related to this thesis. We first discuss previous methods for modeling text and images. This includes image captioning, image retrieval based on textual queries, and weakly-supervised methods using text to supervise visual models. Next, we describe approaches using textual descriptions of video data, and discuss some works on modeling the temporal aspect of videos. We conclude the chapter by a brief overview of the literature related to the models and optimization techniques that we will present in this thesis: discriminative clustering and the Frank-Wolfe optimization algorithm.
Images and text
During the last decade much work has been dedicated to the joint modeling of images and their textual descriptions. The ultimate goal is to represent both types of media in a shared space where their relations can be modeled and analyzed. Many approaches have been proposed and we describe some of them here.
Captioning as machine translation. One of the early attempts towards modeling images and text was presented in [Duygulu et al., 2002]. The idea was to use a machine translation model to align parts of images to words from image captions. The authors propose to decompose images into regions and to represent each region by an entry in a dictionary. Given symbolic representations for words and image regions, the alignment was phrased as an IBM 2 model [Brown et al., 1990]. This model contains two distributions: the probability of an image region given a word and the probability of aligning a word to a given image region. As in the classical translation model, the optimal assignment is recovered using an expectation maximization (EM) algorithm. A broader set of such translation-inspired models is presented in [Barnard et al., 2003]. In addition to the aforementioned IBM model, the authors describe a multi-modal extension of latent dirichlet allocation (LDA) and an extension of a hierarchical clustering model, using text and visual appearance.
The work of Duygulu et al.  was extended a few years later in [Gupta and Davis, 2008] taking into account relationships between nouns. The previously used dataset is extended so as to contain not only a set of nouns for each image but also the spatial relations between them. A more complex probabilistic model that includes relationships between pairs of nouns is introduced and optimized using the EM al-gorithm. Quantitative results are shown on the extended dataset, and demonstrate significant improvements in image labeling performance for image labeling.
Fixed templates. Another line of work proposes to match images and sentences in a common semantic space. Farhadi et al.  consider a semantic space composed of fixed template: a triplet of labels of the form <o,a,s> (object, action and scene). They model these triplets as a conditional random field (CRF) with unary potentials defined as functions of object detectors (for images) or words. The CRF is trained using a dataset composed of images with an associated textual description and a ground truth <o,a,s> triplet. In the experimental section, authors show that the common semantic representation can be used to annotate images by picking sentences that correspond to the same inferred triplet.
The template representation is further extended by Kulkarni et al. . Instead of including a single <o,a,s> triplet per image, the authors propose to create a node per object detection. Each detection is further associated with an attribute node, and each pair of objects is associated with a preposition node. Unary potentials for this model come from standard object detectors, attribute and prepositional classifiers. The pairwise features are computed on a held-out set of data. Once the CRF has been trained, for a given test image one can perform inference and obtain the most likely pattern. The authors propose to generate captions using the words from this pattern and filling in “function words” using a language model. That way, a template <<white,cloud>,in,<blue,sky>> would be transformed into “There is a white cloud in the blue sky”.
Retrieval and CCA. Much of the work on joint representations of text and images makes use of canonical correlation analysis (CCA) [Hotelling, 1936]. CCA consists in finding linear transformations of two variables such that their correlation is maxi-mized. Performing CCA on two modalities finds two projections that allow to bring the two data sources into a common space. This has been explored for images and text by Hardoon et al. , who propose to use a kernelized version of CCA. Sentences are described using term frequencies and standard visual features are computed on images. After learning the two projections, the authors propose to retrieve the closest image given a textual description.
Ordonez et al.  propose to return the caption of an image that is most similar to the query image. The proposed model does not use CCA but relies on the construction of a very large captioned image database. Given a query image, a set of similar images in the database is found based on scene features, such as GIST, of heavily sub-sampled thumbnails. The set of retrieved images is re ranked based on various scores, based on object detections, attributes, people and scene features. The caption of the highest ranked image is returned as a result.
CCA-based image captioning methods are discussed in general by Hodosh et al. . The authors introduce a large benchmark dataset and compare different vari-ants of the model, using various feature representations. Also, the evaluation metrics for image captioning are discussed, and are compared to human judgment. The topics discussed in this paper are particularly interesting given how hard this task is to evaluate.
Extensions to the kernel CCA-based models are proposed in [Gong et al., 2014a,b]. Gong et al. [2014a] propose a multi-view variant where the CCA is computed be-tween three modalities. In that work, the modalities include images, tags that one can typically find in meta data on the web, and image classes. The empirical evalu-ation includes experiments on several datasets with various retrieval schemes: from tag to image, from image to tag. Gong et al. [2014b] propose to help the image-text embedding using a large but weakly annotated set of images. The main dataset is Flickr30K, where each image is associated with a precise description in natural lan-guage. This work investigates whether using the much larger but imprecise Flickr1M and SBU1M datasets can enhance the quality of the embeddings. Indeed, images from these datasets contain titles, tags and very imprecise textual descriptions. The authors demonstrate empirically a moderate improvement in performance when ad-ditional data is used.
Not directly related to image captioning (with complete sentences), but also for-mulated as a retrieval problem, Weston et al.  propose to learn a model for annotation retrieval given a query image. Images are annotated by a single tag, but the datasets used to train the model are “web scale”: they contain 10M training im-ages. The model is trained using a ranking loss, learning an embedding of visual features and annotations such that the correct label is ranked as high as possible. The embeddings learnt that way for the annotations exhibit some semantic informa-tion, as nearest neighbor queries amongst annotation embeddings show interesting similarities.
Deep models. Recently, there has also been an important amount of work on joint models for images and text using neural networks. Many caption generation (or retrieval) variants have been proposed in 2014 and 2015 some of which we will describe here. A significant part of these still rely on a ranking loss and formulate captioning as retrieval, but some propose to learn a caption generation model. Image captioning can either be evaluated by a retrieval metric or by comparing the single proposed caption with the ground-truth one. The advantages of the two approaches are discussed in Hodosh et al. , with an indication that the ranking measures provided more reliable numbers. We will now describe in detail some of the proposed deep captioning models. Closely related to [Weston et al., 2011], the work of Frome et al.  introduces a ranking-based image annotation model. It has the advantage of considering non linear transformations of the image representation and being able to generalize to annotations that have never been seen in the training set. Annotation embeddings are initialized using word2vec representations [Mikolov et al., 2013], and images are represented using a deep convolutional neural network. The similarity between anno-tations and images is measured using a trainable bilinear form . At first, only the bilinear form is optimized, while in the second step of the optimization, the loss is also propagated to the image representation.
While the previously described work computes embeddings for single words of phrases, similar models have been developed for whole sentences [Socher et al., 2014]. In this article, the authors propose to compute a sentence representation and use a ranking loss between images and sentences. A global image representation is used and the sentence is described using a dependency tree recurrent neural network (DT-RNN). In a classical RNN, the nodes of the network correspond to words that are connected as a chain. In this work, the authors propose to create one node per word and connect them following an automatically computed dependency tree. The sen-tence representation is computed at the root. As for other works, the whole model is trained using a ranking loss and compared to previous work, including CCA-based models.
The previously described model measured the similarity between a global image representation and a whole sentence. Another ranking-based model has been proposed in [Karpathy et al., 2014], where fragments of an image (object detections) are aligned to fragments of a sentence (dependency relations). The similarity between the image and a given sentence is measured using a latent alignment cost. All parameters of the model, including the alignment and fragment embeddings are trained as before using a ranking loss. The great advantage of this model is that the alignment was modeled explicitly which allows interesting interpretations. This model is extended and simplified in [Karpathy and Fei-Fei, 2014], where sentence fragments correspond to words. Words are embedded using a bidirectional recurrent neural network, which allows at test time not only to retrieve the closest caption but to actually generate one.
A very simple method, using global image representations and capable of gener-ating new captions is proposed in [Vinyals et al., 2015]. The work described in the previous section makes use of object detections and tries to resolve the alignment of words to parts of the image. In this article, a global image representation obtained from a very large CNN is fed as the first input of a recurrent neural network for text generation. The RNN takes as first input a linear transformation of the image rep-resentation and then subsequently the successive words of the ground-truth caption. At test time, a novel sentence can be sampled by feeding the output of the image rep-resentation and then performing a beam search on the RNN outputs. The proposed model provides good performance while only having access to a global representation of images. Authors evaluate this approach using the BLEU score, but also report some ranking experiments. Captions of the training set are ranked by assigning them a probability given the query image. While the ranking results are encouraging, the authors argue that ranking is not the correct way to evaluate image captioning, hence, opposing the conclusions of Hodosh et al. .
An alternative, phrase-based approach has been proposed by Lebret et al. . Phrases are defined as a group of words that express a specific idea, and are here classified into noun phrases, prepositional phrases and verb phrases. The authors propose to learn a joint model of images and phrases using a maximum likelihood criterion. The model assumes that the conditional probabilities of phrases from the same sentence, conditioned on the image, are independent. Then, the conditional probability of a given phrase is defined using a soft max of a bilinear product between a phrase embedding and an image representation. Images are represented using the output of a CNN while phrases are represented by the average of precomputed word embeddings. After optimizing the negative log-likelihood, at test time, a caption is generated by fitting a language model to the most likely phrases. The last re-ranking step, exploiting visual information, is used to choose the closest sentence. Please note that this method is generating novel captions and does not make use of a ranking loss.
Image captioning and text/image retrieval is a very interesting area of research that is currently lacking a proper task definition and suitable evaluation metrics. In the following section we will discuss how text and image captions can be used as weak supervision, for instance for face recognition.
Text as supervision
Textual information has extensively been used as a form of weak supervision to train visual models. This has especially been exploited in the context of person recog-nition in news photos. Indeed, news pictures are usually associated with captions which potentially describes the people pictured in the image. Berg et al.  were among the first to cope with this rich, free, yet imprecise data. The authors pro-pose to detect faces and rectify them based on facial landmark locations to match a canonical pose. Pixel values of the image are used as features and are projected to a lower dimensional space using kernel principal component analysis (kPCA). Using images with unambiguous labels (portraits with a single name) the authors propose to perform linear discriminant analysis. In this lower dimensional space, a modified k-means clustering procedure is carried out to obtain face clusters associated with names. This work has been extended to a probabilistic formulation in [Berg et al., 2005], using a generative model of faces. The authors have also proposed to model how likely a name is to appear in the picture, conditioned on textual context informa-tion. The proposed model is optimized using expectation maximization and provides better results as compared to [Berg et al., 2004].
Related to the work described above, Luo et al.  propose to extend person identification to action recognition. The goal is now not only to recognize celebrities, but also what they are doing. This article has relations to the previously described work on modeling relations between objects in images and text [Gupta and Davis, 2008]. In a similar spirit, the authors proposed a complete probabilistic model of person identities and related actions. The parameters of the appearance model and the text to image assignments are obtained using the EM algorithm. Empirical evaluation shows that modeling identities and actions jointly works better that only modeling identities. This work is quite similar in spirit to what we will present in Chapter 4.
Wang and Mori  consider a model that allows to classify objects in images using latent attributes. The model does not take natural language as input but works on a “structured” label set, and makes use of richer labels. The model is cast in the latent support vector machine (L-SVM) framework with a potential function com-posed of several terms: an object class model, an attribute model, a class conditioned attribute model, an attribute coocurence term and a class-attribute coocurence term. When the latent variables are unknown in the training phase, the resulting optimiza-tion problem is non-convex. On the datasets used in the experiments, the attributes are available at training time but the authors show that using observed latent vari-ables during training degrades the performance. The authors propose a non-convex cutting plane optimization algorithm and show experiments on two object-attribute datasets, demonstrating significant improvement over contemporary baselines.
Video and text
Text has also been extensively used with video data. First, movie scripts provide a rich source of annotations for action recognition and person identification in movies. This kind of supervision is weak as it does not provide precise sample-label correspon-dences. Several models have been proposed to cope with this kind of uncertainty, often yielding good models with only light supervisory burden. Following the attempts at image captioning, many approaches have been proposed for video captioning. These rely on well annotated video corpora, some of which we will describe here. The closest to our work are methods focusing on the task of aligning videos with corresponding textual descriptions. Some of these methods will be described below.
Similar to the work on captions and faces in news, subtitles and movie scripts have been used together with the video signal to train action and character models. Movie scripts provide a precise description of what is happening on screen. Just like the movie, they are divided into scenes, specifying the setting that will take place. They include two kinds of textual information: dialogues and scene descriptions. Dialogues contain the name of the character that is speaking as well as the pronounced words. In between these dialogues, the script contains scene descriptions which describe how the characters behave and the enviornment in which they evolve. These scripts however do not contain any precise timing indications and are therefore unaligned with the actual movie. Most of them are shooting scripts, written before the shooting, and the final movie can differ (because of actor’s performance, editing etc.).
Thanks to the expansion of collaborative subtitle sharing, textual transcripts of the monologues are freely available on the web. These precisely temporally aligned transcriptions provide a reliable description of the character’s words. One can match the text found in the subtitles with dialogues from the scripts in order to roughly align the script in time. Exploiting this data as a form of supervision was first proposed in Everingham et al.  and gave rise to many subsequent models. We will describe some of the most important ones here.
Face recognition. Everingham et al.  propose to link subtitles to scripts in order to recover the identities of the speakers. Time-stamped subtitles with speaker identities can be then used as reliable labels to recognize people. In this early work, frontal faces are detected and tracked. Then facial landmarks are localized and de-scribed using either SIFT or raw pixel values. The link between subtitles and video is made by detecting lip motion to decide which character is speaking at that given moment. Using conservative thresholds, a reliable set of face tracks is assigned the correct character identity. The other tracks in the video are assigned to an identity using a simple probabilistic model.
This work is further extended in [Sivic et al., 2009] by introducing several im-provements. First of all, a profile face detector is added, improving the face detection coverage. Better face descriptors are used, and a model is learned using a kernel SVM instead of simply relying on a distance in the feature space. The kernel is a mixture of different kernels, one per facial landmark, and between landmarks the closest descrip-tors in both tracks is used to compute the kernel entry. The weights of the mixture are learned and the whole pipeline provides much better performance. However, the use of textual information is identical to the previous work, and relies on heuristic matching of monologue speakers to lip motion.
An interesting alternative is proposed by Cour et al. . Instead of relying on the explicit modeling of speakers to assign labels to face tracks, the authors propose to formulate the problem as an ambiguously labeled one. In the proposed model a data sample can be assigned to multiple labels, as subtitles often correspond to a dialogue. The authors propose a loss that is suitable for this kind of ambiguous setting together with a convex surrogate that leads to a tractable learning algorithm. The method is evaluated on images with captions and on the TV series “Lost”. This way of handling ambiguous script information in [Cour et al., 2009] is related to our contribution described in Chapter 4.
A method very similar in spirit to the contribution described in Chapter 4 is presented in [Ramanathan et al., 2014]. While building upon our weakly-supervised character recognition model, the authors add interesting extensions. This work not only uses character mentions in the script to generate constraints, but also includes a model for co-reference resolution in the text. Indeed, characters are not always named using their full name and the joint text-and-video model allows to both improve face recognition and the co-reference. The authors demonstrate experimentally that both problems actually benefit from this joint model and evaluate this improvement along the iterations of the algorithm.
The work that we describe here only use scripts to recognize characters. In our work, we also investigate scripts as a source of supervision for training action models.
In the following section we will review some related publications that used movies and script data as a source of supervision of some sort.
Action recognition. Movies with associated scripts were first used in [Laptev et al., 2008] to automatically build an action recognition dataset. The authors propose to train a text classifier that predicts whether the script mentions an action or not. Eight different actions are considered and the classifier is trained on an annotated set of 12 movie scripts. Retrieving relevant scene descriptions using this classifier works much better than when using simple regular expression matching. The video corre-sponding to the classified scene description is used as training samples for learning an action classifier. Using the raw piece of video is compared with using a cleaned-up version where the temporal boundaries have been corrected by an annotator.
This work is extended in [Marszalek et al., 2009], adding several actions and better exploiting textual data. Apart from learning action models, this paper shows how to train scene classifiers using script-based supervision and evaluates both. A richer model of actions given a scene context is proposed where the influence of scenes on actions is either obtained from scripts (by counting) or trained. The action recognition dataset obtained in this paper constitutes the Hollywood2 dataset, which is a well-known action recognition benchmark.
The previously described methods provide accurately labeled video clips but the temporal bounds are often imprecise. Duchenne et al.  address this problem by trying to automatically adjust these bounds based on a discriminative cost. Given the imprecise temporal bounds (extended by a given amount), the model selects the temporal window inside that is most discriminative. The problem is formulated as a discriminative clustering, based on the hinge loss. The optimization is carried out by a coordinate descent approach, iterating between learning the optimal model and picking the most discriminative window. This cleanup process is evaluated for the end task of action detection in movies for two classes of interest. The authors show that the cleaned-up windows provide better training samples than the raw ones but still worse than the ground-truth ones. This work is related to what we will describe in Chapter 5, where a list of actions is aligned to a video sequence.
Several datasets with video descriptions have been released and used for building joint video and text models. In this section, we describe those that contain curated textual descriptions, written on purpose or manually corrected in some way. Regneri et al.  present a video dataset designed to work on grounding textual descrip-tions. The point is to discover the “meaning” of sentences by looking at the corre-sponding visual data. The main contribution of this paper is the dataset, composed of 212 HD videos, each described by 20 different annotators. The videos correspond to a restricted cooking setup, where simple recipes are prepared in front of a static camera. All videos come from a larger cooking dataset that is described in details in [Rohrbach et al., 2012a]. Descriptions are obtained using Amazon Mechanical Turk. Annotators were asked to describe the content of the video in 5 to 15 steps, each step being one sentence. Additional annotations were then added to measure the similarity between descriptions. Overall the dataset is interesting but of limited size, if one wants to train language models on it.
Another dataset of videos with associated textual descriptions was introduced in [Rohrbach et al., 2015]. The dataset is composed of 94 movies, out of which 55 are provided with Audio Descriptions and 50 with movie scripts. A total of 11 movies are provided with both the audio description and the script which allows the authors to compare the quality of the two kinds of textual descriptions. Scripts are obtained from the web, aligned using subtitles as explained earlier, and then the alignment is corrected manually. Audio descriptions are descriptions that are provided for the visually impaired and describe precisely what is happening on screen. The authors propose to transcribe them to text using a crowd-sourcing audio transcription service. The dataset is composed of 68337 video clips with a textual description, yielding in total 74 hours of video with more than 650k words. This dataset is much larger and contains very challenging visual content as compared to the restricted cooking setup. Some of the works described in the following section make use of the datasets that we have presented here.
Table of contents :
1.4 Contributions and outline
2 Related Work
2.1 Images and text
2.1.1 Image captioning
2.1.2 Text as supervision
2.2 Video and text
2.2.1 Weak supervision
2.3 Temporal models for video
2.3.1 Action models
2.3.2 Composite activities
2.4 Learning and optimization
2.4.1 Discriminative clustering
3.1 Learning and supervision
3.1.1 Fully-supervised learning
3.1.2 Weakly-supervised learning
3.1.3 Unsupervised learning
3.1.4 Semi-supervised learning
3.2.1 Generative clustering
3.2.2 Discriminative clustering
3.2.4 Adding constraints
4 Weakly supervised labeling of persons and actions in movies
4.1.1 Contributions of this chapter
4.2 Joint Model of characters and Actions
4.2.1 Notations and problem formulation
4.2.2 Discriminative loss
4.2.3 Grouping term
4.2.4 Constraints on latent variables
4.2.5 Slack Variables
4.3.2 Splitting the Optimization
4.4 Relation to Diffrac [Bach and Harchaoui, 2007]
4.5 Features and Dataset
4.5.1 Text processing
4.5.2 Video features
4.6.1 Learning names : controlled set-up
4.6.2 Comparison with other weakly supervised methods
4.6.3 Learning names and actions
4.7 Conclusion and future work
5 A convex relaxation and efficient algorithm for aligning video and text
5.2 Proposed model
5.2.1 Problem statement and approach
5.2.2 Basic model
5.2.3 Priors and constraints
5.2.4 Full problem formulation
5.3.2 The Frank-Wolfe Algorithm
5.3.3 Minimizing Linear Functions over 𝒵 by dynamic programming
5.4 Semi-supervised setting
5.5 Experimental evaluation
5.5.1 Controlled setup
5.5.2 Aligning sequences of actions [Bojanowski et al., 2014]
5.5.3 Text-to-video alignment
5.6 Conclusion and discussion
A Zero eigenvalues of 𝑄