Descriptors for action recognition
When it comes to action recognition, descriptors based purely on spatial appearance are no longer informative enough. It becomes necessary to use descriptors that capture mo-tion information, or that blend spatial and motion information together (spatio-temporal descriptors).
One of the first descriptors applied for action recognition was the Motion History Im-age (MHI) [Bobick 2001], which labels each pixel as having/not having motion (or as in how many frames did it experience motion recently); template matching is then used to recognize the action. However, the method cannot be applied in situations with camera motion or with cluttered scenes, being sensitive to parasitic movement and to occlusion. Other methods to recognize actions are based on detecting and representing the motion of human body parts, such as [Brendel 2010] and [Tran 2012].
Although not specifically dedicated to action recognition, we can also mention the work of [Tanase 2013], which extends the Bag of Words model by separating local features into two categories: features belonging to the (static) background and features corresponding to foreground objects in motion. Two histograms of visual words are thus constructed, one for static features and one for moving features, thereby separating information corresponding to the static and to the moving parts of a video. The authors then choose to concatenate the two histograms to form the video descriptor, but other strategies of exploiting these two types of information could be envisaged. For example, the BoW histogram of moving features can be used to detect objects that are usually in motion, while the other BoW histogram can be used for objects that are normally static. The results from the histogram of static features can then be considered as context information, and can be used to reinforce the results from the moving features histogram.
An interesting approach for action recognition is presented in [Rosales 1999]. Objects of interest (persons) are segmented using a continuously-updated background model. The object bounding boxes are then tracked across frames using Extended Kalman Filters, with adaptations that allow predicting and detecting occlusions (in order not to interrupt tracking when a short-time occlusion occurs). Tracking allows to align each object across frames and to construct object-centric representations using Motion History Images, from which the action can be recognized. The system is interesting because it employs feedback loops that improve processing on lower stages based on results from higher stages, treating in a unified manner the problems of tracking, trajectory estimation and action recognition. However, although the approach is well-suited for video surveillance contexts with a fixed camera and uncluttered scenes, unfortunately it would not work in TRECVid SIN, because the setting is too diverse and uncontrolled.
Descriptors for video content
In general, methods that try to characterize video volumes as a whole are aﬀected by occlusion and clutter (something passing in front of the action of interest changes its appearance). Local approaches, on the other hand, describe only small bits of videos (video features that are local in space and time) instead of large video volumes. They then use an aggregation strategy, such as the Bag of Words model, to construct the description of a larger video volume based on its small parts. As in the case of purely spatial descriptors, the BoW model ignores spatial and temporal relations. There exist models that also encode spatial and temporal relations, such as spatio-temporal pyramidal representations (but these impose a rigid definition of the space-time division) [Laptev 2008], or Actom Sequence Models that encode the temporal succession of action elements (action atoms, actoms) [Gaidon 2011].
There is a high diversity of spatio-temporal descriptors, but it can be noted that many of them describe the spatial appearance component with the aid of descriptors based on Histograms of Oriented Gradients (HOG), SIFT being a good example of a HOG-based descriptor. As for the motion component, the optical flow is often used, indicating the direction of motion in every pixel, which can be used to construct descriptors such as His-tograms of Optical Flow (HOF). Motion can also be represented on longer time intervals by tracking the motion of points across many frames and constructing trajectories. Spa-tial appearance and motion can also be described at the same time, such as with HOG-3D descriptors based on gradient orientations in 3D (space-time) [Kläser 2012]. We will give some examples of spatio-temporal descriptors below, concentrating on local representa-tions, as these are more appropriate for the diverse, unconstrained TRECVid context.
Spatio-temporal interest points
Some approaches detect local features that are distinctive not only in space, but also in time, spatio-temporal interest points, and then describe the spatio-temporal neighborhoods of these features [Laptev 2003, Ke 2005, Dollár 2005, Niebles 2008].
For example, in [Laptev 2003], spatio-temporal interest points are detected using an extension of the Harris corner detector to 3 dimensions (2D space + time). This gives features that are at the same time spatial corners, and experience a non-constant motion such as an abrupt change in motion direction. A spatio-temporal cuboid, as the ones in Figure 2.5, is then described with one or more descriptors, and the results are fed into a Bag of Words model for action recognition.
The approach is extended in [Laptev 2007] to make it invariant to the local constant-velocity component of motion; a spatio-temporal cuboid looks diﬀerent when it undergoes acceleration around a zero local motion, or when the acceleration takes place while the spatial corner was undergoing uniform translation (the uniform translation will skew the spatio-temporal neighborhood). This brings robustness to camera motion or uniform object translation, at the cost of losing discriminative power in simpler scenarios without camera motion.
For describing spatio-temporal cuboids, the following types of descriptors were pro-posed in [Laptev 2007]:
• N-jets and multi-scale N-jets, which are spatio-temporal Gaussian derivatives up to Image credit: [Laptev 2007] order N of the cuboid;
• histograms of first-order partial derivatives (intensity gradients in the spatio-temporal domain);
• histograms of optical flow;
Histogram descriptors were explored in both a position-independent way (a single his-togram for the entire cuboid) or in position-dependent ways (the cuboid was divided ac-cording to a spatio-temporal grid and histograms were computed on the elements of the grid and then concatenated). Principal Component Analysis was also used optionally for dimensionality reduction. Upon testing on the KTH dataset, the ranking of the descriptors varied depending on whether or not position dependent or independent histograms were used, and whether or not PCA was used, but it can be said that generally, histograms of spatio-temporal gradients and of optical flow performed better than N-jets. Also, position-dependent histograms performed better than position-independent ones, because they de-scribed the cuboids in more detail and were thus more discriminative [Laptev 2007].
Histograms of Oriented Gradients (HOG) and Histograms of Optical Flow (HOG) were used to describe cuboids extracted from Hollywood movies in [Laptev 2008]. HOF per-formed better than HOG, but a combination of the two was shown to outperform both.
The Motion Scale Invariant Feature Transform (MoSIFT) is a detector and descriptor for local video features that combines spatial appearance and motion information. The classi-cal 2D SIFT detector is used to detect spatial features in he video frames. Afterwards, only spatial features that also experience significant optical flow are kept, discarding features that do not have enough motion.
Descriptors for video content
For the description step, spatial appearance is described using the classical SIFT de-scriptor. But SIFT is made from histograms of oriented gradients, and the optical flow in a pixel also has a magnitude and an orientation, just like the intensity gradient. Therefore, a SIFT-like descriptor can be constructed from the optical flow field in the same manner as it is constructed from the image intensity gradient field. The static appearance part is adapted for rotation invariance, but the motion appearance part is not, because it is important to keep the motion direction unaltered as it constitutes an important cue for action recogni-tion. The spatial appearance SIFT vector and the motion SIFT-like vector are concatenated to produce the 256-dimensional MoSIFT feature descriptor. A BoW strategy can then be used to aggregate the local features.
The descriptor was shown to outperform approaches based on spatio-temporal interest points on the KTH dataset, and it also outperformed 3D Histograms of Oriented Gradients on the TRECVid 2008 Surveillance Event Detection task [Chen 2009].
Trajectories of tracked points
Trajectories contain important information about motion in the video. Object centroids can be tracked and their motion described, although this does not give a lot of information for action recognition. Tracking body parts can give more information, as many human actions are characterized by a succession of body parts positions. Or either dense or sparse trajectories, not necessarily from body parts, can be constructed and described. Tracking local features (either from a dense grid or sparse) presents an advantage in unconstrained scenarios, because they are less sensitive to occlusion, viewpoint variations, variability of the objects/persons performing the actions and variability of context.
In [Vrigkas 2013], dense optical flow is computed on every frame of the video, from which motion curves (trajectories) are extracted. Motion curves belonging to the back-ground are eliminated, based on whether or not the total optical flow along the curve is large enough (insuﬃcient motion characterizes a background feature). Trajectories of varying lengths are allowed, and the Longest Common Sub-Sequence is used to compare two trajectories. The approach worked very well on the KTH dataset, with an accuracy of 96,71%.
Computing dense optical flow fields is computationally expensive, but computing opti-cal flow for a small set of keypoints is much faster. Therefore, [Matikainen 2009] proposes to detect features with the Good Features To Track detector [Shi 1994b], and track them across frames using a classical Kanade-Lucas-Tomasi (KLT) tracker [Birchfield 2007]. These trajectory elements, called trajectons, are described using concatenated vectors of spatial derivatives (displacements in x and y from one frame to the next), to which an aﬃne model of the local deformation along the trajectory can be added. The model was not made robust to scale variations, neither spatial nor temporal, and the fact that a motion can be captured starting from diﬀerent moments was dealt with by considering the same trajectory several times, but with shifted starting and ending moments. The trajectories are fed into a BoW model, and Support Vector Machines with linear kernels are used for classification (LIBSVM, [Chang 2001]).
Trajectons were again used in [Wu 2011], where dense trajectories of points are ex tracted. This time, camera motion is dealt with by decomposing trajectories into their camera-induced component and object (person) induced component, without the need to perform an alignment of video frames. The approach gave 95,7% precision on the KTH dataset.
In [Wang 2011], dense trajectories are constructed by tracking points from a dense grid via dense optical flow fields. A fixed length of 15 frames is used for all trajectories (called tracklets) because the authors noted that representing trajectories at multiple tem-poral scales does not improve their results. The shape of a trajectory is encoded with a normalized vector of displacements. Additionally, trajectory-aligned descriptors are also computed: the local spatial appearance around a tracked point is represented with a His-togram of Oriented Gradients (HOG) averaged across the 15 frames, while local motion around the tracked point is represented with a Histogram of Optical Flow (HOF). A third trajectory-aligned descriptor is the Motion Boundary Histogram (MBH): spatial deriva-tives of the horizontal and vertical components of the optical flow are computed, and then histograms of orientations are constructed for these derivatives, giving rise to the MBH. Because MBH do not characterize the optical flow itself, but the relative motion between adjacent pixels, they are robust to camera motion. Dense trajectories have an advantage over tracking sparse points, because many more features are fed into the model, which is one of the reasons why the approach performs well on a variety of action recognition datasets (94,2% on KTH) [Wang 2011].
Similar dense trajectories and trajectory descriptors as in [Wang 2011] are used in [Jiang 2012], with the following diﬀerences: camera motion compensation is done by clus-tering motion patterns and describing trajectories relative to the three most important mo-tion patterns, and relations between trajectories are encoded by considering trajectory pairs and describing the relative positions and relative motions of the members of the pair with respect to each other.
Instead of using dense trajectories, [Ballas 2011] employs a Diﬀerence of Gaussians detector to detect sparse points in frames, tracking being performed by matching SIFT descriptors of keypoints from consecutive frames. Trajectories are described using his-tograms of motion directions (first-order statistics), Markov Stationary Features (second-order statistics) and histograms of acceleration directions (for robustness to the uniform translation component of motion). Replacing displacement vectors with histograms of dis-placements gives robustness to the exact moment of the beginning of an action. Spatial appearance along the trajectory is also represented using the average SIFT descriptor along the tracked point.
Table of contents :
1.1 More and more multimedia data
1.2 The need to organize
1.3 Examples of applications
1.4 Context, goals and contributions of this thesis
2 State of the art
2.1 Generalities about Content-Based Video Retrieval
2.2 General framework for semantic indexing
2.3 Descriptors for video content
2.3.1 Color descriptors
2.3.2 Texture descriptors
2.3.3 Audio descriptors
2.3.4 Bag of Words descriptors based on local features
2.3.5 Descriptors for action recognition
2.4 Information fusion strategies
2.5 Proposed improvements
2.6 Standard datasets for concept detection
2.6.1 The KTH human action dataset
2.6.2 The Hollywood 2 human actions and scenes dataset
2.6.3 The TRECVid challenge: Semantic Indexing task
3 Retinal preprocessing for SIFT/SURF-BoWrepresentations
3.1 Behaviour of the human retina model
3.1.1 The parvocellular channel
3.1.2 The magnocellular channel
3.1.3 Area of interest segmentation
3.2 Proposed SIFT/SURF retina-enhanced descriptors
3.2.1 Keyframe based descriptors
3.2.2 Temporal window based descriptors with salient area masking
3.3.1 Preliminary experiments with OpponentSURF
3.3.2 Experiments with OpponentSIFT
4 Trajectory-based BoW descriptors
4.1.1 Choice of points to track
4.1.2 Tracking strategy
4.1.3 Camera motion estimation
4.1.4 Replenishing the set of tracked points
4.1.5 Trajectory selection and trimming
4.1.6 Trajectory descriptors
4.1.7 Integration into the BoW framework
4.2 Preliminary experiments on the KTH dataset
4.2.1 Experimental setup
4.3 Experiments on TRECVid
4.3.1 Experimental setup
4.3.2 Differential descriptors
4.4 Global conclusion on trajectories
5 Late fusion of classification scores
5.2 Choice of late fusion strategy
5.3 Proposed late fusion approach
5.3.1 Agglomerative clustering of experts
5.3.2 AdaBoost score-based fusion
5.3.3 AdaBoost rank-based fusion
5.3.4 Weighted average of experts
5.3.5 Best expert per concept
5.3.6 Combining fusions
5.3.7 Improvements: higher-level fusions
5.4.1 Fusion of retina and trajectory experts
5.4.2 Fusion of diverse IRIM experts
6 Conclusions and perspectives
6.1 A retrospective of contributions
6.1.1 Retina-enhanced SIFT BoW descriptors
6.1.2 Trajectory BoW descriptors
6.1.3 Late fusion of experts
6.2 Perspectives for future research
7.1.1 L’explosion multimédia
7.1.2 La nécessité d’organiser
7.1.3 Contexte des travaux et contribution
7.2 Etat de l’art
7.2.1 La base vidéo TRECVid
7.2.2 La chaîne de traitement
7.2.3 Descripteurs pour le contenu vidéo
7.2.4 Stratégies de fusion tardive
7.2.5 Améliorations proposées
7.3 Pré-traitement rétinien pour descripteurs SIFT/SURF BoW
7.3.1 Le modèle rétinien
7.3.2 Descripteurs SIFT/SURF BoW améliorés proposés
7.3.3 Validation sur la base TRECVid 2012
7.4 Descripteurs Sac-de-Mots de trajectoires
7.4.2 Descripteurs de trajectoire
7.4.3 Validation sur la base KTH
7.4.4 Expérimentations sur la base TRECVid SIN 2012
7.5 Fusion tardive de scores de classification
7.5.2 Résultats sur la base TRECVid 2013
7.5.3 Conclusion concernant la fusion
7.6 Conclusions et perspecti