State-of-the-art on Analyzing Landmark Sequences for Human Behavior Understanding

Get Complete Project Material File(s) Now! »

Human landmarks

Several human-related Computer Vision problems can be approached by first detecting and tracking landmarks from visual data. A relevant example of this is given by the estimated 3D location of the joints of the skeleton in depth streams [106], and their use in action and daily activity recognition [123, 131, 59]. In this case, for each frame of the depth video a set of 3D joints are detected on some articulations of the human body forming a 3D skeleton. In Fig. 2.1, we show an example of a tracked skeleton in a depth video provided by a Kinect V2 sensor. Hence, the problem of analyzing human body motion in a depth video could be eﬃciently turned to studying the motion of the 3D skeleton along the video. More sophisticated solutions for automatic tracking of the 3D skeleton do exist, as the IR body markers used in MoCap systems, but they are expensive in cost and time. These systems provide a large number of joints with high temporal resolution and accurate estimations (see Fig. 2.1). Recently, advances in human pose estimation methods from RGB videos have also made the tracking of 2D/3D skeletons in RGB videos possible and have shown an impressive performance [113, 5, 18].
Another relevant example of human landmark tracking is represented by the face, for which several approaches have been proposed for fiducial points detection and tracking in video [8, 137, 23, 21]. These methods detect a set of 2D key points localized at relevant positions of the human face. For instance, several methods opted for detecting landmark points around the eyes, eyebrows, nose, and mouth. Other systems, considered additional landmarks around the chin. In the left panel of Fig. 2.2, we show some examples of 2D facial landmark estimations. One can note that such estimations could lead to distortions in the analysis due to large pose variations. To overcome this problem, some works tried to estimate the 3D locations of these landmark points from only RGB videos [21, 114]. Examples of these 3D estimations are illustrated in the right panel of Fig. 2.2. It is important to note that, in addition to their impressive performance, most of these methods are real-time solutions for tracking human landmarks.

Why landmark sequences for human behavior understanding?

In this thesis, we will focus on designing eﬀective landmark based solutions for some human behavior understanding tasks. One of our motivations for this choice is driven by the recent impressive advances in human landmark tracking. As mentioned above, recently landmark detection and tracking methods from human faces and bodies became reliable and accurate. They are robust to illumination changes that occur in RGB images, and in some cases robust to occlusions (see the woman wearing sunglasses in the left panel of Fig. 2.2). By considering the tracked landmarks instead of the original images, we take advantage of the robustness of tracking methods to these classical problems in Computer Vision and expect the same robustness for our landmark based solutions.
Furthermore, considering only tracked landmarks reduces the complexity of the visual data. Instead of using a large number of pixels in each frame of the original video, which could make the analysis computationally intense, landmark trackers bring a brief summary of the frame by providing only a set of relevant 2D/3D points (the number of points typically varies from 15 to 90 points). Hence, landmark based solutions are expected to be more eﬃcient and less computational expensive than other solutions, which makes them more suitable for real-time applications.

Feed-forward neural networks

In [65], the authors proposed a neural network architecture called Deep Temporal Geometry Network (DTGN) for facial expression recognition from 2D facial landmark sequences. The facial landmarks were firstly normalized then concatenated over time to form a single vector representation which is fed to a neural network. The architecture of DTGN consists of Fully Connected (FC) layers and softmax.
In the context of 3D action recognition, the authors in [38] proposed to use Convolutional Neural Networks (CNNs). Specifically, the three coordinates of all skeleton joints in each frame were separately concatenated by their physical connections. A matrix was then generated by arranging the representations of all frames in chronological order, then quantified and normalized into an image. The obtained image represented the skeletal sequences and was finally fed into a hierarchical spatial-temporal adaptive filter banks model for representation learning and recognition. CNNs were also investigated in 3D action recognition in [69], but in a diﬀerent way. The authors generated three clips corresponding to the three channels of the cylindrical coordinates of a skeleton sequence. A deep CNN model and a temporal mean pooling layer were used to extract a compact representation from each frame of the clips. The output CNN representations of the three clips at the same timestep were concatenated, resulting in diﬀerent feature vectors. Another neural network (FC layers and Softmax) was used on these feature vectors for action classification. Dibeklioglu et al., [35] tackled the problem of measuring depression severity level from 2D facial landmark sequences. They used Stacked Denoising Auto-Encoders (SDAE) to encode the static observations of 2D facial landmark sequences. By doing so, the authors obtained a more discriminative low-dimensional feature representation of the static facial landmarks. They exploited this representation to derive motion features such us velocities and accelerations. Deep auto-encoders were also explored for 3D action recognition. For instance, they were used in [22] to encode the dynamics of the skeletal sequences. In this work, three diﬀerent temporal encoder structures were proposed (i.e., symmetric, time-scale, and hierarchy encoding) which were designed to capture diﬀerent spatial-temporal patterns.

READ Coupling environmental, social and economic models

Table of contents :

1 Introduction
1.1 Motivation and challenges
1.2 Thesis contributions
1.3 Organization of the manuscript
2 State-of-the-art on Analyzing Landmark Sequences for Human Behavior Understanding
2.1 Introduction
2.2 Human behavior understanding
2.2.1 Terminology
2.2.2 Applications
2.3 Human landmarks
2.3.1 Why landmark sequences for human behavior understanding?
2.3.2 Challenges
2.4 Temporal modeling and classification of landmark sequences
2.4.1 Probabilistic methods
2.4.2 Kernel methods
2.4.3 Deep learning methods
2.4.3.1 Feed-forward neural networks
2.4.3.2 Recurrent neural networks
2.4.4 Riemannian methods
2.4.4.1 Landmark sequences as points on Riemannian manifolds
2.4.4.2 Landmark sequences as trajectories on Riemannian manifolds
2.4.4.3 Classification on Riemannian manifolds
2.5 Conclusion
3 Novel Geometric Framework on Gram Matrix Trajectories for Emotion and Activity Recognition
3.1 Introduction
3.2 Gram matrix for shape representation
3.3 Riemannian geometry of the space of Gram matrices
3.3.1 Mathematical preliminaries
3.3.1.1 Grassmann manifold
3.3.1.2 Riemannian manifold of positive definite matrices
3.3.2 Riemannian manifold of positive semi-definite matrices of fixed rank
3.3.2.1 Tangent space and Riemannian metric
3.3.2.2 Pseudo-geodesics and closeness in S+(d; n)
3.3.3 Affine-invariant and spatial covariance information of Gram matrices
3.4 Gram matrix trajectories for temporal modeling of landmark sequences
3.4.1 Rate-invariant comparison of Gram matrix trajectories
3.4.2 Adaptive re-sampling
3.5 Classification of Gram matrix trajectories
3.5.1 Pairwise proximity function SVM
3.5.2 K-Nearest neighbor
3.6 Experimental evaluation
3.6.1 3D action recognition
3.6.1.1 Datasets
3.6.1.2 Experimental settings and parameters
3.6.1.3 Results and discussion
3.6.2 3D emotion recognition from body movements
3.6.2.1 Dataset
3.6.2.2 Experimental settings and parameters
3.6.2.3 Results and discussion
3.6.3 2D facial expression recognition
3.6.3.1 Datasets
3.6.3.2 Experimental settings and parameters
3.6.3.3 Results and discussion
3.7 Conclusion
4 Barycentric Representation of Facial Landmarks for Expression Recognition and Depression Severity Level Assessment
4.1 Introduction
4.2 Affine-invariant shape representation using barycentric coordinates
4.2.1 Relationship with the conventional Grassmannian representation
4.3 Metric learning on barycentric representation for expression recognition in unconstrained environments
4.3.1 Facial expression classification
4.3.2 Experimental results
4.3.2.1 Results and discussions
4.4 Facial and head movements analysis for depression severity level assessment
4.4.1 Facial movements analysis using barycentric coordinates
4.4.2 Head movements analysis in Lie algebra
4.4.3 Kinematic features and Fisher vector encoding
4.4.3.1 Kinematic features
4.4.3.2 Fisher vector encoding
4.4.4 Assessment of depression severity level
4.4.5 Experimental evaluation
4.4.5.1 Dataset
4.4.5.2 Results
4.4.6 Interpretation and discussion
4.5 Conclusion
5 Conclusion and Future study
5.1 Conclusions and limitations
5.2 Towards geometry guided deep covariance descriptors for facial expression recognition
5.3 Future works
Bibliographie