Performance for the recognition of the hand in vertical motion: speed variation recognition

Get Complete Project Material File(s) Now! »

Head pose estimation

In surveillance systems the knowledge of head poses provides an important cue for higher level behavioural analysis and the focus of an individual’s attention often indicates their desired destination [60]. In addition to contributing to the task of robust face recognition for multi-view analysis which is still a difficult task under pose variation [82], pose estimation can also be considered as a sub problem of the general area of intention detection as it is useful for inference of nonverbal signals related to attention and intention. This makes head pose estimation solutions a key component for HCI [83]. Existing head pose estimation methods can be grouped into:
– Model-based methods (within which we also classify Active Appearance-based methods) [84]
– Appearance-based methods [85]
– Feature-based approaches [86], [87], [88], [89] within which we also classify appearance-based subspace methods.
Appearance-based techniques use the whole sub-image containing the face while model-based approaches use a geometric model.

Model-based solutions

Several works on head pose measurement in low resolution video involve the use of labelled training examples which are used to train various types of classifiers such as neural networks [90], [86], [91], support vector machines [92] or nearest neighbour and tree-based classifiers [93], [94], [95]. Other approaches model the head as an ellipsoid and either learn a texture from training data [96] or fit a re-projected head image to find a relative rotation [97]. In [98], a 2D ellipse is used to approximate the head position in the image. The head position is obtained using colour histogram or image gradients. However, light changes and different skin colours result in tracking failures. Another drawback with such an approach is the inability to report head orientation. In [99], partial orientation information, such as tilt or yaw is available. However, the accuracy of those systems is low (up to 15 degree error in estimating rotation).
Recently, model-based approaches like the bunch graph approach, PCA, Eigen faces and Active Appearance Models (AAMs), have received considerable interest. AAMs [100] are nonlinear parametric models derived from linear transformations of a shape model and an appearance model. A neural network can also be trained to distinguish between different persons or to make a distinction between poses of one person’s face [101]. The system proposed in [102], uses neural networks on each camera view to estimate head orientation in either direction. For the fusion of the multiple views, a Bayesian filter is applied to both diffuse prior estimates (temporal propagation) as well as search for the most coherent match of overlapping single view hypotheses over all the included sensors.
Unsupervised approaches such us eigenfaces [103], [104] learn the subspace for recognition via the Principle Component Analysis (PCA) [105] of the face manifold, while supervised approaches like Fisherfaces [106] learn the metric for recognition from labelled data via the Linear Discriminant Analysis (LDA). Linear approaches in head pose estimation are found in [107], [108], [109]. It must be noted that the PCA/LDA approaches for head pose estimation are limited because of the non-linearity of the underlying manifold structure, and richness in local variations. In recent years, non-linear methods for high dimensional non-linear data modelling, Locally Linear Embedding (LLE) [110] and Graph Laplacian [111], perform very well in finding manifold structure through embedding a graph structure of the data derived from the affinity modelling. When the problem space is large, a kernel method [112], [113] is employed and in other cases where complexity is an issue, a piece-wise linear subspace/metric learning method [114] is developed to map out the global nonlinear structure for head pose estimation. Template matching is another popular method used to estimate head pose where the best template can be found via a nearest-neighbour algorithm and where the pose associated with this template is selected as the best pose. Advanced template matching can be performed using Gabor Wavelets and Principle Components Analysis (PCA) or Support Vector Machines, but these approaches tend to be sensitive to alignment and are dependent on the identity of the person [82].
More accurate systems use 3D geometrical models to represent the head as a rigid body. In [115], Yang and Zhang use a rough triangular mesh with semantic information. Stereo is used to obtain 3D information, which is matched against the known model. A major shortcoming of this method is the amount of time one needs to spend to create a precise model. Recent approaches use a cylinder to approximate both the underlying head geometry and texture [116], [117]. Since a cylinder is only a rough approximation of head geometry, those methods suffer from inaccuracies in estimating rotation, and have difficulties differentiating between small rotations and translations. In [118], an approach for 3D head pose estimation from a monocular sequence is proposed. To estimate the head pose accurately and simply, an algorithm is used based on the geometry information of the individual face and projective model without the need of any 3D face model and any special markers on the user’s face. Another 3D solution to head pose estimation is presented in [119] where the system relies on a novel 3D sensor that generates a dense range image of the scene. In [82], a novel discriminative feature is introduced which is efficient for pose estimation. The representation is based on the Local Gabor Binary Pattern (LGBP) and encodes the orientation information of the multi-view face images into an enhanced feature histogram. A Radial Basis Function (RBF) kernel Support Vector Machines (SVM) classifier is used to estimate poses. The aim of the work in [83] is to develop a new vision-based method which can estimate the 3D head pose with high accuracy with an adaptive control of diffusion factors in a motion model of a user’s head used in particle filtering.

Appearance and feature-based techniques

Appearance-based approaches use filtering and image segmentation techniques to extract information from the image. Some typical appearance-based techniques include optical flow algorithms as well as edge detectors such as Gabor wavelets [120]. Filtering and segmentation resulting from appearance-based methods play a significant role in head pose estimation, but it must be noted that few head pose estimation algorithms are known to be exclusively appearance-based as they require the step of recognising the pose. Among the exceptions are the Gabor head pose estimation described in [87] where the weights of a Gabor wavelet network directly represent the orientation of the face. Its disadvantage, however, is the computational effort involved, which is very user specific. Brolly et al. [121] used Nurbs surface with texture to synthesise both appearance and pose, but could not report pose accuracy since ground truth was unavailable. Several researchers [122], [123] introduced the notion of extended super quadric surface, or Fourier synthesised representation of a surface, which possesses a high degree of flexibility to encompass the face structure. They use model-induced optical flow to define pose error function. The usage of a parameterised surface enables them to resolve ambiguities caused by self occlusion.
The majority of feature-based algorithms use the eyes as features since they are easy to detect due to their prominent appearance. The nostrils are also features that are used; however, they become invisible as soon as the user tilts his head downwards. The mouth is also easy to find except when covered by a moustache or a beard. Several authors use a set of these features to estimate a 3D head orientation. In [62], the authors address the problem of estimating head pose over a wide range of angles from low-resolution images. Faces are detected using chrominance-based features. Grey-level normalised face images serve as input for linear auto-associative memory. One memory is computed for each pose using a Widrow-Hoff learning rule. Head pose is classified with a winner-takes-all process. Fitzpatrick [124] demonstrates a feature-based approach to head pose estimation without manual initialisation. For feature detection and tracking the cheapest paths across the face region is found, whereby the cost of a path depends on the darkness of crossed pixels. The paths will therefore avoid dark regions and a pair of avoided regions is assumed to be the pair of eyes. The algorithm is thus dependant on the visibility of the eyes. Head pose is then determined based mainly on the head outline and the eye position. Gorodnichy [125] demonstrates a way to track the tip of the nose by using the resemblance of the tip of the nose with a sphere with diffuse reflection. This template is then searched in the image. This approach does not estimate the head pose, but simply tracks the nose tip across the video images and therefore a pose recognition task has to be added. In [61], a novel approach to estimate head pose from monocular images, which roughly classifies the pose as frontal, left profile, or right profile is presented. Subsequently, classifiers trained with adaboost using Haar-like features, detect distinctive facial features such as the nose tip and the eyes. Based on the positions of these features, a neural network finally estimates the three continuous rotation angles used to model the head pose.
Appearance-based subspace methods that treat the whole face as a feature vector in some statistic subspace has recently become popular. They avoid the difficulties of local face feature detection and face modelling. However, in the subspace, the distribution of face appearances under variable pose and illumination is always a highly non-linear, non-convex and maybe twisted manifold, which is very difficult to analyse directly [126]. Murase and Nayar [127] make a parametric description of this nonlinear manifold to estimate pose in a single PCA subspace. Pentland et al. [128] construct the view-based subspaces to detect face and estimate pose. The same idea is used in [85] to estimate head poses in the Independent Subspace Analysis (ISA) subspace. Some approaches solve this problem by kernel-based methods such as Support Vector Regression (SVR) [129] and Kernel Principal Component Analysis (KPCA) [113].
As shown in this section, the area of head pose estimation is rich and at the same time opens to interesting new avenues of investigation. The solutions proposed in this work however use two model-based approaches, namely, the template matching and PCA and a symmetry-based approach, which can be classified among appearance-based approaches. The solutions presented in this section focus mostly on single frame head pose estimation but not on the way in which these positions vary. It is, however, an important component of the intent recognition solution proposed in this work. The next section focuses on hand gesture recognition.

READ High variability of particulate organic carbon export fluxes in the North Atlantic Ocean

Hand gesture recognition

Hand gesture recognition from video images is of considerable interest as a means of providing simple and intuitive man-machine interfaces. Possible applications range from replacing the mouse as a pointing device to virtual reality, communication with the deaf and to Human-Computer Interaction (HCI). M.W. Krueger [130] proposed gesture-based interaction as a new form of HCI in the middle of the 1970s initially, which has since witnessed a growing interest in aiming at making HCI as natural as possible [131]. Much human visual behaviour can be understood in terms of the global motion of the hands. Such behaviours include most communicative gestures [132], [133] as well as movements performed in order to control and manipulate physical or virtual objects [134], [135], [136], [137]. Hand gestures and poses are not only extensively employed in human non-verbal communication [138], but are also used to complement verbal communication as they are co-expressive and complementary channels of a single human language system [139], [140], [141]. The primary goal of any automated gesture recognition system is to create an interface that is natural for humans to operate or communicate with a computerised device [142], [143]. There are three main categories of hand gesture analysis approaches [144]:
– Glove-based analysis [145]
– Vision-based analysis that can be divided into model-based [146] and state-based [147], and analysis of drawing gestures
– There are also solutions that approach the problem from a neuroscience point of view [148] Glove-based approaches have several drawbacks including the fact that they hinder the ease and natural way with which the user can interact with the computer-controlled environment; they also require long calibration and setup procedures [143]. The non-intrusive property of vision-based approaches makes them more suitable, thus rendering them probably the most natural way of constructing a human-computer gesture interface as they do not require any additional devices (e.g. gloves) and can be implemented with off-the shelf devices (e.g. webcams) [149]. Yet it is also the most difficult type of approach to implement in a satisfactory manner.
There are two main approaches in hand pose estimation. The first approach is the full Degree Of Freedom (DOF) hand pose estimation that targets all the kinematic parameters (i.e., joint angles, hand position or orientation) of the skeleton of the hand, leading to a full reconstruction of hand motion [143]. The second one consists of ‘‘partial pose estimation’’ methods that can be vie wed as extensions of appearance-based systems that capture the 3D motion of specific parts of the hand such as the fingertip(s) or the palm. These systems rely on appearance-specific 2D image analysis to enable simple, low DOF tasks such as pointing or navigation. 3D hand models offer a way of more elaborate modelling of hand gestures but lead to computational hurdles that have not been overcome given the real-time requirements of HCI. Appearance-based models lead to computationally efficient “pur posive” approaches that work well under constrained situations but seem to lack the generality desirable for HCI [150].
There are an increasing number of vision-based gesture recognition methods in the literature. Baudel and Beaudouin-Lafom [151], Cipolla et al. [152], and Davis and Shah [153] all describe systems based on the use of a passive “data glove” with markers that can be tracked relatively easily between frames. A 3D structure from image sequences is recovered in [152] but does not attempt to classify gestures. David and Shah [154] propose a model-based approach by using a finite state machine to model four qualitatively distinct phases of a generic gesture. Hand shapes are described by a list of vectors and then matched with the stored vector models. Darrell and Pentland [155] propose a space-time gesture recognition method. Signs are represented using sets of view models, and then matched to stored gesture patterns using Dynamic Time Warping (DTW). Cui and Weng [156] developed a system based on a segmentation scheme which can recognise 28 different gestures in front of complex backgrounds. In [157] Ohknishi and Nishikawa propose a new technique for the description and recognition of human gestures. The proposed method is based on the rate of change of gesture motion direction that is estimated using optical flow from monocular motion images. Nagaya et al. [158] propose a method to recognise gestures using an approximate shape of gesture trajectories in a pattern space defined by the inner-product between patterns on continuous frame images. Heap and Hogg
[159] present a method for tracking a hand using a deformable model, which also works in the presence of complex backgrounds. The deformable model describes one hand posture and certain variations of it and is not aimed at recognising different postures. Zhu and Yuille [160] developed a statistical framework using PCA and stochastic shape grammars to represent and recognise the shapes of animated objects.
It is called Flexible Object Recognition and Modelling System (FORMS). Rehg and Kanade [161] describe a system that does not require special markers. They use a 3D articulated hand model that they fit to stereo data but do not attempt gesture recognition. Blake et al. [162] describe a tracking system based on a real-time “snake” that can deal with arbitrary pose, but trea ts the hand as a rigid object.
An important application of hand gesture recognition is sign language understanding [132]. In [163], a large set of isolated signs from a real sign language is recognised with some success using a low-end instrumented glove using two machine learning techniques:
– Instance-Based Learning (IBL)
– Decision-tree learning.
Simple features were extracted from the instrumented gloves, namely the distance, energy and time of each sign. They have several advantages among which the most important are cost, processing power and the fact that the data extracted from a glove are concise and accurate. On the other hand, gloves are an encumbrance to the user and today’s most convenient solutions require the property of being non-intrusive. In addition to instrumented gloves, early approaches to the hand gesture recognition problem in a robot control context involved the use of markers on the finger tips [164]. Again, the inconvenience of placing markers on the user’s hand makes this solution less suited in practice. Liang et al. [165] developed a gesture recognition system for TSL using Data-Glove to capture the flexion of 10 finger joints, the roll of palm and other 3D motion information. In [166], [167] and [168], two visual HMM-based systems are presented for recognising sentence-level continuous American Sign Language (ASL) using a single camera to track the user’s unadorned hands. To segment each hand initially, the algorithm scans the image until it finds a pixel of the appropriate colour, determined by an a priori model of skin colour. Given this pixel as a seed, the region is grown by checking the eight nearest neighbours for the appropriate colour. Each pixel checked is considered to be part of the hand. The tracking stage of the system does not attempt a fine description of hand shape, instead, concentrating on the evolution of the gesture through time. In [169], a gesture recognition method for Japanese sign language is presented making use of the computational model called Parallel Distributed Processing (PDP) and a recurrent neural network for recognition. Huang et al. [170] use a 3D neural network method to develop a Taiwanese Sign Language (TSL) recognition system to identify 15 different gestures. Lockton et al. [171] propose a real-time gesture recognition system, which can recognize 46 ASL letter spelling alphabet and digits. The gestures consist of “static gestures” where the hand does not move.

Table of contents :

Chapter 1
Introduction
1.1 Problem Statement
1.2 Motivation and objectives
1.3 Sub-problems
1.4 Assumptions
1.5 Scope
1.6 Contribution
1.7 Outline
Chapter 2
Literature Survey
2.1 Introduction
2.2 Intention detection
2.3 Robotic wheelchairs
2.4 Head pose estimation
2.4.1 Model-based solutions
2.4.2 Appearance and feature-based techniques
2.5 Hand gesture recognition
2.6 Conclusion
Chapter 3
Head-Based Intent Recognition
3.1 Introduction
3.2 Pre-processing steps: face detection and tracking
3.2.1 Histogram-based skin colour detection
3.2.2 Adaboost-based skin colour detection
3.2.3 Face detection and localisation
3.2.3.1 Erosion
3.2.3.2 Dilation and connected component labelling
3.2.3.3 Principal Component Analysis
3.3 Recognition of head-based direction intent
3.3.1 Symmetry-based Approach
3.3.2 Centre of Gravity (COG) of the Symmetry Curve
3.3.3 Linear Regression on the Symmetry Curve
3.3.4 Single frame head pose classification
3.3.5 Head rotation detection: Head-based direction intent recognition
3.4 Recognition of head-based speed variation intent
3.5 Adaboost for head-based direction and speed variation recognition
3.5.1 Adaboost face detection
3.5.2 Camshift tracking
3.5.3 Nose template matching
3.6 Conclusion
Chapter 4
Hand-Based Intent Recognition
4.1 Introduction
4.2 Pre-processing steps: Hand detection and tracking
4.3 Recognition of hand-based direction intent
4.3.1 Vertical symmetry-based direction intent recognition
4.3.2 Artificial Neural Networks (Multilayer Perceptron)
4.3.3 Support Vector Machines
4.3.4 K-means clustering
4.3.5 Hand rotation detection: Direction intent recognition
4.3.6 Template-matching-based direction intent recognition
4.4 Recognition of hand-based speed variation intent
4.4.1 Template Matching-based speed variation recognition
4.4.2 Speed variation recognition based on ellipse shaped mask
4.5 Histogram of oriented gradient (HOG) for hand-based speed variation recognition
4.6 Conclusion
Chapter 5
Results and Discussion
5.1 Introduction
5.2 Head-based intent recognition
5.2.1 Performance for the recognition of the head in rotation: direction recognition
5.2.2 Performance for the recognition of the head in vertical motion: speed variation recognition
5.3 Hand-based intent recognition
5.3.1 Performance for the recognition of the hand in rotation: direction recognition
5.3.2 Performance for the recognition of the hand in vertical motion: speed variation recognition
5.4 Extrapolation for data efficiency
5.5 Concluding remarks
Chapter 6
Conclusion
6.1 Summary of contributions
6.2 Concluding remarks
6.3 Future work