Experimental results on the Online Dynamic Hand Gesture (Online DHG) dataset

Get Complete Project Material File(s) Now! »

Acquisition systems of depth images and 3D skele- tal data

Analyzing and understanding a real-world scene observed by a camera is the main goal of many computer vision systems. Standard cameras provide only 2D information which have some limitations when analyzing and understanding the scene. Having the full 3D information about the observed scene became an important challenge in computer vision. In a traditional stereo-vision system, using two cameras observing the same scene from different points of view, we compare the two images to develop a disparity image and estimate the relief of the scene.
For the task of human motion analysis, extracting 3D information about the human pose is a challenge that has attracted many researchers. Motion capture systems are able to capturing an accurate human pose, and track it along the time using markers representing the human pose. Motion capture data have been widely used in industry such as in anima-tion and video games. However, these systems present some disadvan-tages. First, the high cost of this technology limits its usage. Second, it implies that the subject wears physical markers to estimate the 3D pose. In addition to standard RGB images, depth sensors provide a depth map giving for each pixel the corresponding distance with respect to the sensor. The 3D information of the scene can be estimated from such depth maps. Behind these depth sensors, there are two types of technology: Structured light: a visible or invisible known pattern is projected in the scene. A sensor analyzes the distortion of the pattern in contact with objects and estimates the distance of each point of the pattern; Time of flight: a light signal is emitted in the scene. Knowing the speed of light, a receiver computes the distance of the object based on the time elapsed between the emission of the signal and its reception. Depth sensors, like Microsoft Kinect 1 [61] or Asus Xtion PRO LIVE employ the structured light technique, while the new Microsoft Kinect 2 employs the time of flight. These new acquisition devices have stimulated the development of various promising applications. A recent review of kinect-based computer vision applications can be found in [47]. In 2011, Shotton et al. [134] proposed a real-time method to accurately predict the 3D positions of 20 body joints from single depth image, with-out using any temporal information. Thus, the human pose can be repre-sented as a 3D humanoid skeleton. Such RGB-D sensors provide for each frame the 2D color image of the scene, its corresponding depth map and a body skeleton representing the subject pose. An example is illustrated in Figure 2.7.

Related work on hand pose estimation

Accurate hand pose estimation is an important requirement for many Human-Computer Interaction or Augmented Reality tasks, and has at-tracted lots of attention in the Computer Vision research community.

Hand pose estimation from RGB images

There has been a significant amount of works that dealt with hand pose estimation using RGB images. Those approaches can be divided into two categories: model based approaches and appearance based approaches [176]. Model based approaches generate hand pose hypotheses and evaluate them with the input image. Heap et al. [48] proposed to fit the mesh of a 3D hand model with the surface of the hand by a mesh constructed via Principal Component Analysis (PCA) from training samples. Real-time tracking is achieved by finding the closest possibly deformed model matching the image. Henia et al. [50] used a two-step minimization algo-rithm for model-based hand tracking. They proposed a new dissimilarity function and a minimization process that operates in two steps: the first one provides the global parameters of the hand, i.e. position and orien-tation of the palm, whereas the second step gives the local parameters of the hand, i.e. finger joint angles. However, those methods are unable to handle the occlusion problem. Appearance based methods use directly the information contained in images. They do not use an explicit prior model of the hand, but rather seek to extract the region of interest of the hand in the image. Bretzner et al. [13] used color features to recognize hand shapes. Therefore, the hand can be described as one big blob feature for the palm, having smaller blob features representing fingers. This became a very popular method but has some drawbacks such as skin color detection which is very sensitive to lighting conditions. We refer the reader to Garg et al. [39] for an overview of hand pose estimation based on RGB approaches.

Hand pose estimation from depth images

The hand pose estimation community has rapidly grown larger in recent years. The introduction of commodity depth sensors and the multitude of potential applications have stimulated new advances. However, it is still challenging to achieve efficient and robust estimation performance because of large possible variations of hand poses, severe self-occlusions and self-similarities between fingers in the depth image.
A) Tracking based hand pose estimation
We focus our analysis on single frame methods. However, for complete-ness, we introduce Oikonomidis et al. [106] which proposed a tracking approach and, therefore, need a ground-truth initialization. They formulated the challenging problem of 3D tracking of hand artic-ulations as an optimization problem that minimizes differences between hypothesized 3D hand model instances and an actual visual observa-tions. Optimization was performed with a stochastic method called Parti-cle Swarm Optimization (PSO) [60]. Figure 2.11 illustrates their pipeline, where they first extracted the region of interest of the hand from a depth image and then fitted a 3D hand model using PSO. For an image at step t, the model is initialized using the final one found from the image .

Table of contents :

List of Figures
1 Introduction
1.1 Thesis Contributions
1.2 Thesis outline
2 Literature overview
2.1 Introduction
2.1.1 Hand gesture understanding problem
2.1.2 Applications
2.2 Acquisition systems of depth images and 3D skeletal data
2.3 Datasets for hand pose estimation
2.4 Related work on hand pose estimation
2.4.1 Hand pose estimation from RGB images
2.4.2 Hand pose estimation from depth images
2.5 Datasets for hand gesture recognition
2.6 Related works on hand gesture recognition
2.6.1 Pre-processing steps for hand localization
2.6.2 Spatial features extraction
2.6.3 Temporal modeling
2.6.4 Classification
2.6.5 Deep learning approaches
2.7 Discussion and conclusion
3 Heterogeneous hand gesture recognition using 3D skeletal features
3.1 Introduction
3.1.1 Challenges
3.1.2 Overview of the proposed method
3.1.3 Motivations
3.2 The Dynamic Hand Gesture dataset (DHG-14/28)
3.2.1 Overview and protocol
3.2.2 Gesture types included
3.2.3 DHG-14/28 challenges
3.3 Hand gesture recognition using skeletal data
3.4 Features extraction from skeletal sequences
3.5 Features representation
3.6 Temporal modeling
3.7 Classification process
3.8 Experimental results
3.8.1 Experimental settings
3.8.2 Hand Gesture Recognition Results
3.8.3 Latency analysis and computation time
3.8.4 Influence of the upstream hand pose estimation step on hand gesture recognition
3.9 Conclusion
4 Recent deep learning approaches in Computer Vision
4.1 Introduction
4.1.1 Different pipelines in Computer Vision: handcrafted versus deep learning approaches
4.1.2 Feature extraction
4.1.3 Pros and cons
4.2 Where does deep learning come from and why is it so hot topic right now?
4.2.1 History
4.2.2 Perceptrons and biological neurons similarities
4.2.3 Why only now?
4.3 Technical keys to understand Deep Learning
4.3.1 The multilayer perceptrons
4.3.2 Training a feedforward neural network
4.4 Technical details of deep learning elements
4.4.1 Softmax function
4.4.2 Cross-entropy cost function
4.4.3 Convolutional Neural Network
4.4.4 Recurrent Neural Networks
4.5 Conclusion
5 Dynamic hand gestures using a deep learning approach
5.1 Introduction
5.1.1 Challenges
5.1.2 Overview of the proposed framework
5.1.3 Motivations
5.2 Deep extraction of hand posture and shape features .
5.2.1 Formulation of hand pose estimation problem
5.2.2 Hand pose estimation dataset
5.2.3 Pre-processing step
5.2.4 Network model for predicting 3D joints locations
5.2.5 CNN training procedure
5.3 Temporal features learning on hand posture sequences
5.4 Temporal features learning on hand shape sequences
5.5 Training procedure
5.6 Two-stream RNN fusion
5.7 Experimental results on the NVIDIA Hand Gesture dataset
5.7.1 Dataset
5.7.2 Implementation details
5.7.3 Offline recognition analysis
5.7.4 Online detection of continuous hand gestures
5.8 Experimental results on the Online Dynamic Hand Gesture (Online DHG) dataset
5.8.1 Dataset
5.8.2 Offline recognition analysis
5.8.3 Online recognition analysis
5.9 Conclusion
6 Conclusion
6.1 Summary
6.2 Future works
Bibliography