Mouth structure segmentation using K-Means and Gaussian mixtures .

Get Complete Project Material File(s) Now! »

Lip segmentation based on pixel color classi cation

Pixel-based segmentation encompasses pixel classi cation using region color distribution mod-els. Commonly, those models are linear and non-linear approximations of the region separation boundaries, reducing the classi cation problem to comparing the input feature vectors against a set of thresholds. The thresholds de ne region boundaries in the feature space, conforming models of the regions’ color distribution by themselves.
Thresholds can be set statically, taking into account any prior knowledge about the contents of the image, or they can also be dynamically adjusted to achieve a proper segmentation of the image. Dynamical threshold selection rely in either local statistics, like in the case of adaptive thresholding [5], or global statistics usually based in the image histogram, like in Otsu’s technique [12]. In most cases, these techniques manage to maximize interclass variances while minimizing intraclass variances.
Succeeding pixel color based lip segmentation requires a proper input representation of the data. Thereupon, several approaches for nding appropriate color representations are treated.

Color representations used in mouth structure segmentation

The rst step in solving a segmentation task is to nd an adequate input representation which helps highlighting the existent di erences among regions. Those representations can be classi ed in two di erent categories. The rst category is composed by general purpose color transformations that prove to be helpful in the speci c application; and the second comprises color transformations which are speci cally designed aiming towards the application through the use of linear and non-linear transformations. In this Section, some approaches explored in both categories are treated brie y. The lip region and skin are very similar in terms of color [13]. For this reason many di erent color transformations have been developed.
Some linear transformations of the RGB color space have led into fair results in term of color separability between lips and skin. For example, Guan [14, 15], and Moran & Pinto [16] made use of the Discrete Hartley transform (DHT) in order to improve color representation of the lip region. The component C3 of the DHT transformation properly highlights lip area for subjects with pale skin and no beard, as shown in Figure 2.3d. Chiou & Hwang [17] used the Karhunen{Loeve Transform in order to nd the best color projection for linear dimension reduction.
Hue-based transformations are also used in lip segmentation. The pseudo hue transformation, proposed in the work of Hurlbert & Poggio [18], exhibits the di erence between lips and skin under controlled conditions, as seen in Figure 2.3b. The pseudo hue transformation focuses in the relation between red and green information of each pixel, and is de ned as pH = R=(R + G). It leads into a result that appears very similar to the one achieved using hue transformation, but its calculation implies less computational reckoning, and therefore was a preferred alternative in some work [19, 1, 20]. Particularly, a normalized version of this representation was used in the development of a new color transformation, called the Chromatic Curve Map[1]. As with hue, pseudo hue cannot separate properly the lips’ color from the beard and the shadows. Pseudo-hue may generate unstable results when the image has been acquired under low illumination, or in dark areas; this e ect is mainly due to a lower signal to noise ratio (SNR). l’Eclairage), have been used in order to remove the in uences of varying lighting conditions, so that the lip region can be described in a uniform color space. Using a chromatic representation of color, the green values of lip pixels are higher than that of the skin pixels. Ma et al. [21] reported that in situations with strong/dim lights, skin and lip pixels are better separated in chromatic space than in the pseudo-hue plane. In order to address changes in lighting, in [22] the color distribution for each facial organ on the chromaticity color space was modeled. Chrominance transformations (Y CbCr, Y CbCg) have also been used in lip segmentation [23]. The use of pixel based, perceptual non-linear transformations, which presents a better color constancy over small intensity variations, has been a major trend in the late 90s. Two well- known perceptual color transformations, presented by the CIE, are the CIEL a b and CIELu0v0. Theprinciple behind these transformations is the compensation of natural logarith-mic behavior of the sensor. Work like [24, 25, 26] made use of these color representations in order to facilitate the lip segmentation process in images. Like in Y 0CbCr, L a b and Lu0v0 representations theoretically isolate the e ect of lighting and color in separated components1.
In Gomez et al. [27], a combination of three di erent color representations is used. The resulting space enables the algorithm to be more selective, leading into a decrease of spurious regions segmentation. The authors perform a clipping of the region of interest (RoI) in order to discard the nostril region, as shown in Figure 2.2.
One major disadvantage of the pixel based techniques is the lack of connectivity or shape constrains in its methods. In order to deal with that problem, Lucey et al. [28] proposed a segmentation algorithm based on dynamic binarization technique that takes into account local information in the image. The rst step in Lucey’s method is to represent the image in a constrained version of the R=G ratio. Then, an entropy function that measures the uncer-tainty between classes (background and lips) is minimized, in terms of membership function parameters. After that, a second calculation based in neighboring region information is used to relax the threshold selection. Despite of the threshold adaptation, a later post-processing is needed in order to eliminate spurious regions.

Mouth segmentation derived from contour extraction

Lip could be interpreted as a deformable object, whose shape or contours can be approximated by one or many parametric curves. Then, it might seem evident that one must rst look for the mouth contour before trying to segment its inner structures. In this Section, some techniques that have been used in outer lip contour approximation in images in lip segmentation are brie y discussed. Polynomials can be used as lip contour approximation, notably between second and fth degrees. For instance, in Stillittano & Caplier [51] four cubics are used to represent mouth contour starting from a series of detected contour points, two in the upper lip and two in the lower lip. The keypoints are extracted using the Jumping Snakes technique presented by Eveno et al. [1, 20, 2]. The authors reported some issues due to the presence of gums or tongue.
Nevertheless, as stated in [52], low order polynomials (up to fourth degree) are not suitable for anthropometric applications since they lack in mapping capabilities for certain features; in the other hand, high order polynomials may exhibit undesired behavior on ill-conditioned zones. For that reason, most of the work that use polynomials to approximate the lip contour are intended for lipreading applications.
Werda et al. [53] adjust the outer contour of the mouth by using three quadratic functions: two for the upper lip and one for the lower lip. The adjustment is performed after a binarization process in the RnGnBn color space where the lighting e ect is reduced. Each component of the RGB space is transformed to the new RnGnBn by An=255*A/Y, with Y the intensity value. The nal representation contains a strong geometric parametric model, whose parameters enable the contour to be deformed into a constrained set of possible shapes. In [54] three or four parabolas are used to extract mouth features for the closed and open mouth respectively. Rao & Mesereau [55] used linear operators in order to nd the horizontal contour of lips, and then they approximate that contours with two parabolas. Delmas et al. [56] extract from the rst frame of a video sequence, the inner and outer lip contour by using two quartics and three parabolas. The polynomials are de ned by the corners and vertical extrema of the mouth, which are found by using [57].
Active contours or snakes are computer-generated curves that move within images to nd object boundaries. In this case, the inner and outer contour of the mouth. They are often used in computer vision and image analysis to detect and locate objects, and to describe their shape. An active contour can be de ned as a curve v(u; t) = (x(u; t); y(u; t)); u 2 [0; 1], with t being the temporal position of the point in the sequence, that moves in the space of the image [58]. Evolution of the curve is controlled by the energy function in (2.11). Eac = Eint(v(u)) + Eim(v(u)) + Eext(v(u))du 0Z 1 (2.11).

READ Towards a new neuroscience of artistic appreciation

Other techniques used in shape or region lip segmentation

In the work by Goswami et al.[81], an automatic lip segmentation method based on two di erent statistical estimators is presented: a Minimum Covariance Determinant Estimator and a non-robust estimator. Both estimators are used to model skin color in images. The lip region is found as the largest non-skin connected region. The authors present a signi cant improvement over the results reported in [20]. This method uses the assumption that the skin region could be detected more easily than lips. Mpiperis et al. [82] introduced an algorithm which classi es lip color features using Maximum Likelihood criterion, assuming Gaussian probability distributions for the color of the skin and the lips. They also compensate gestures by using a geodesic face representation. Lie et al. [83] uses a set of morphological image operations in temporal di erence images, in order to highlight the lip area.
Arti cial intelligence has also been used in lip segmentation. Mitsukura et al. [84, 85] use two previously trained feed forward neural networks in order to model skin and lip color. Shape constraints are included in the weights of the lip detection neural network. Once a mouth candidate is detected, a test of skin is performed in its neighborhood, using the skin detector network. After that, a lip detection neural network is used in order to select the mouth region. In another work, the same authors presented a second scheme [86] based in evolutionary computation for lip modeling.

Active Shape Models (ASMs) and Active Appearance Models (AAMs)

ASMs are statistical shape models of objects, which iteratively deform to t to an example of the object in a new image. They do not conform to what one may interpret as a segmentation technique, but they are nevertheless widely used in object detection and classi cation in images. The goal using ASMs is to approximate a set of points in the image (usually provided by the acquired object’s contour information) by a point distribution model, composed by the mean shape of the object model x plus a linear combination of the main modes of variation of the shape P , as shown in (2.19).
x = x + P b (2.19).
b is a vector of weights related to each of the main modes. The matrix P is obtained from a set of training shapes of the object, as the t main eigenvectors of the covariance of the shapes’ point position. x is represented in an object frame scale and rotation, and thus, the measured data should be adjusted in order to be approximated by such model. Further information in how ASMs are trained can be found in [87, 88].
In the work of Caplier [89], a method for automatic lip detection and tracking is presented. The method makes use of an automatic landmark initialization, previously presented in [90]. Those landmarks serve to select and adapt an Active Model Shape (ASM) which describes the mouth gesture. Kalman ltering is used in order to speed up the algorithm’s convergence through time. In Shdaifat et al. [91] an ASM is used for lip detection and tracking in video sequences, the lip boundaries are model by ve Bezier curves. In [92], a modi ed ASM algorithm is employed to search the mouth contour. The modi cation consist in the use both local gray intensity and texture information on each landmark point. In cases where it is possible that landmark points are incorrect (for example when the lip boundary is not clear), it is better to characterize the distribution of the shape parameter b by a Gaussian mixture rather than by single Gaussian. Jang et al. [93] developed a method for locating lip based on ASM and using a Gaussian mixture to represent the distribution of the shape parameter. In Jiang et al. [94], a mixture of deterministic particle lter model and stochastic ASM model is used, in order to improve convergence and accuracy in lip tracking. Jian et al. [95] used an approach called radial vector, which is similar to ASMs, but with an implicit labeling of model data. The authors performed the training of their model using particle ltering. AAMs are a generalization of the ASMs approach, which include not only shape information in the statistical model, but also texture information [96]. The basic formulation of the AAM can be seen in (2.20). x = x + Qsc g = g + Qgc (2.20).

Table of contents :

1 Introduction
2 Mouth structure segmentation in images
2.1 Previous work
2.1.1 Lip segmentation based on pixel color classication
2.1.2 Mouth segmentation derived from contour extraction
2.1.3 Shape or region constrained methods for lip segmentation
2.1.4 Performance measurement in mouth segmentation tasks
2.2 Experimental work ow
2.2.1 Database description
2.2.2 Error, quality and performance measurement
2.2.3 Ground truth establishment
2.3 Summary
3 Pixel color classication for mouth segmentation
3.1 Color representations for mouth segmentation
3.1.1 Discriminant analysis of commonly-used color representations
3.1.2 Eect of image pre-processing in FLDA
3.1.3 Case study: the normalized a component
3.2 Gaussian mixtures in color distribution modeling
3.2.1 The K-Means algorithm
3.2.2 Gaussian mixture model estimation using Expectation-Maximization
3.2.3 Case study: Color distribution modeling of natural images
3.2.4 Mouth structure segmentation using K-Means and Gaussian mixtures .
3.3 Summary
4 A perceptual approach to segmentation renement
4.1 Segmentation renement using perceptual arrays
4.2 Special cases and innite behavior of the rener
4.3 Unsupervised natural image segmentation renement
4.3.1 Rener parameter set-up
4.3.2 Pixel color classication tuning
4.4 Mouth structures segmentation renement
4.5 Summary
5 Texture in mouth structure segmentation
5.1 Low-level texture description
5.2 High-level texture description
5.2.1 Integration scale
5.2.2 Scale based features for image segmentation
5.3 Scale based image ltering for mouth structure classication
5.4 Texture features in mouth structure classication
5.5 Automatic scale-based rener parameter estimation
5.6 Summary
6 An active contour based alternative for RoI clipping
6.1 Upper lip contour approximation
6.2 Lower lip contour approximation
6.3 Automatic parameter selection
6.4 Tests and results
6.5 Summary
7 Automatic mouth gesture detection
7.1 Problem statement
7.1.1 Motivation
7.1.2 Previous work
7.1.3 Limitations and constraints
7.2 Acquisition system set up
7.3 Mouth structure segmentation
7.3.1 Pre-processing and initial RoI clipping
7.3.2 Mouth segmentation through pixel classication
7.3.3 Label renement
7.3.4 Texture based mouth/background segmentation
7.3.5 Region trimming using convex hulls
7.4 Mouth gesture classication
7.4.1 Region feature selection
7.4.2 Gesture classication
7.4.3 Gesture detection stabilization
7.5 Summary
8 Conclusion
9 Open issues and future work