Human behavior understanding problem

Get Complete Project Material File(s) Now! »

The problem of human behavior understanding can be initially defined as follows: given a set of known sequences of different human behav-iors, which of them is performed during an observed test sequence? The problem can then be extended to the analysis of a long unknown mo-tion sequence, where different behaviors are performed successively and should be recognized and localized in the time by the system. Figure 2.1 illustrates this problem. ing has been widely investigated in computer vision from 2D images or 2D videos taken from standard RGB cameras ([75, 49, 84, 77]). However, most of these methods suffer from some limitations coming from 2D videos, like the sensitivity to color and illumination changes, background clutter and occlusions. Since the recent release of RGB-D sensors, like Microsoft Kinect [56] or Asus Xtion PRO LIVE [7], new opportunities have emerged in the field of human motion analysis and understanding. Hence, many research groups investigated data provided by such cameras in order to benefit from some advantages with respect to RGB cameras. Indeed, depth data allow better considering the 3D structure of the scene and thus easily per-form background substraction and detect people in the scene. In addition, the technology behind such depth sensors allows more robustness to light variations as well as working in complete darkness.
However, the task of understanding human behaviors is still difficult due to the complex nature of the human motion. What further complicates the task is the necessity of being robust to execution speed and geometric transformations, like the size of the subject, its position in the scene and its orientation with respect to the sensor. Additionally, in some contexts, human behaviors imply interactions with objects. While such interactions can help to differentiate similar human motions, they also add challenges, like occlusions of body parts.
Moreover, the main challenge of human behavior understanding sys-tems is the online capability. Indeed, a system able to analyze the human motion in an online way is very important for two main reasons. First, it allows a low latency making the interaction with the system more nat-ural. Second, it allows the processing of very long motion sequences of different behaviors performed successively.

Behavior terminology

Before going more in detail, an initial definition of behavior terminology is necessary. During this thesis, we identify three main types of human be-haviors: human gestures, human actions and human activities. Each type of behaviors is characterized by a different degree of motion complexity, degree of human-object interaction and duration of the behavior. This is summarized in Table 2.1. However, we note that in the state-of-the-art, the boundaries between these three terminologies are often smooth as on behavior can lie between two behavior types. For instance, a simple action performed with one arm can be assimilated as a gesture. Conversely, an action performed with an object can be viewed as an activity.

RGB-D and 3D data

Analyzing and understanding a real-world scene observed by an acquisi-tion system is the main goal of many computer vision systems. However, standard cameras only provide 2D information about the scene. The lack of the third dimension results in some limitations while analyzing and understanding the 3D scene. Hence, having the full 3D information about the observed scene became an important challenge in computer vision.
In order to face this challenge, research groups tried to imitate the human vision. The human perception of the scene relief is formed in the brain, which interprets the two plane images from two eyes, so as to build one single image in three dimensions. This process of reproducing relief perception from two plane images is called stereoscopic vision. It has been widely investigated in computer vision, so as to obtain the 3D information of a scene. By using two cameras observing the same scene from slightly different points of view, a computer can compare the two images to develop a disparity image and estimate the relief of the scene. For instance, this technique is currently used to create 3D movies.
For the task of human motion analysis, having 3D information about the human pose is also a challenge, which has attracted many researchers. Motion capture systems, like those from Vicon [90] are able of accurately capturing human pose, and track it along the time resulting in high res-olution data, which include markers representing the human pose. Mo-tion capture data have been widely used in industry, like in animation and video games. In addition, many datasets have been released pro-viding such data for different human actions in different contexts, like the Carnegie Mellon University Motion Capture database [18]. However, these systems present some disadvantages. First, the cost of such tech-nology may limit its use. Second, it implies that the subject wears some physical markers so as to estimate the 3D pose. As a result, this technol-ogy is not convenient for the general public. More recently, new depth sensors have been released, like Microsoft Kinect [56] or Asus Xtion PRO LIVE [7]. Figure 2.2 shows pictures of these devices.

READ DYNAMIC ENVIRONMENT OPTIMISATION ALGORITHMS

Table of contents :

List of Figures
1 Introduction
1.1 Thesis contributions
1.2 Thesis organization
2 State-of-the-art
2.1 Introduction
2.1.1 Human behavior understanding problem
2.1.2 Behavior terminology
2.1.3 Applications
2.2 RGB-D and 3D data
2.3 Datasets
2.4 Related work
2.4.1 Action recognition
2.4.2 Activity recognition
2.4.3 Online detection
2.5 Conclusion
3 Shape Analysis
3.1 Introduction
3.1.1 Motivation of shape analysis
3.1.2 Riemannian shape analysis framework
3.2 Mathematical framework
3.2.1 Representation of shapes
3.2.2 Elastic distance
3.2.3 Tangent space
3.3 Statistics on the shape space
3.3.1 Mean shape
3.3.2 Standard deviation on the shape space
3.3.3 K-means on the shape space
3.3.4 Learning distribution on the shape space
3.4 Conclusion
4 Action Recognition by Shape Analysis of Motion Trajectories
4.1 Introduction
4.1.1 Constraints
4.1.2 Overview of our approach
4.1.3 Motivation
4.2 Shape analysis of motion trajectories
4.2.1 Spatio-temporal representation of human motion
4.2.2 Invariance to geometric transformation
4.2.3 Body part representation
4.2.4 Trajectory shape representation
4.2.5 Trajectory shape analysis
4.3 Action recognition
4.3.1 KNN classification
4.3.2 Average trajectories
4.3.3 Body part-based classification
4.4 Experimental evaluation
4.4.1 Action recognition analysis
4.4.2 Representation and invariance analysis
4.4.3 Latency analysis
4.5 Conclusion
5 Behavior Understanding byMotion Unit Decomposition
5.1 Introduction
5.1.1 Challenges
5.1.2 Overview of the approach
5.1.3 Motivations
5.2 Segmentation of motion sequences
5.2.1 Pose representation
5.2.2 Motion segmentation
5.3 Segment description
5.3.1 Human motion description
5.3.2 Depth appearance
5.3.3 Vocabulary of exemplar MUs
5.4 Detection of repetitions and cycles
5.4.1 Detection of periodic movement
5.4.2 Action segmentation
5.4.3 Repetitions removal
5.5 Experimental evaluation of periodic movement detection
5.5.1 Action segmentation
5.5.2 Action recognition
5.6 Modeling complex motion sequences
5.6.1 Dynamic naive bayesian classifier
5.6.2 Learning
5.6.3 Classification
5.7 Experimental evaluation of modeling for recognition
5.7.1 Gesture recognition
5.7.2 Activity recognition
5.7.3 Online detection of actions/activities
5.8 Conclusion
6 Conclusion
6.1 Summary
6.2 Limitations and future work
Bibliography