Current Recognition Models
Regarding the recognition of object, most of the state of the art works that have been conducted are based on the recognition models that previous learned on a dataset or large-scale training examples to accomplish the further recognition task. However, according to the differentials in learning method of different models, the majority of recognition models could be generally divided into static model and dynamic model. Suggested by the terms, the main difference between static model and dynamic model is that whether if the model can be update after the training process. To be specific, the static model is learned in the training process with labeled data without any update, while the dynamic model can update itself by certain learning approaches with new data or even unlabeled data.
Though there are many recognition models have been presented in literatures and proved to be effective, each kind of model has its own advantages and limitations as well. From static model point of view, the bag-of-word (BoW) (also known as bag-of-visual-word, bag-of-feature and bag-of-visual-feature in other literatures) model is the well-researched one among other models. The BoW model is originally motivated by the texture recognition issue, in which the image representation of particular texture can be characterized by the histogram of image textons as the feature vector of texture class (Cula and Dana, 2001). Thereafter, the BoW model was applied in the field of object recognition by (Sivic and Zisserman, 2003) and (Sivic and Zisserman, 2006) in the term of ‘bag-of-visual-words’ (BoVW) and has exhibited good performance.
Saliency Detection Models
From auditory saliency detection point of view, inspired by the research work of visual saliency map, (Kayser et al., 2005) initially proposed an auditory saliency map for salient sound detection, in which the auditory saliency model is based on the spectrogram of the input sound signal and the spectrum based auditory saliency maps are obtained by using center-surround difference operator (see (Itti and Koch, 2001)) to transform auditory saliency into image saliency for further analyzing. Though experiment results have shown that this model is able to find a salient natural sound among background noise, only the visual saliency features from the image of sound are considered while the information of auditory signal have not been taken into account. The second model is proposed by (Kalinli and Narayanan, 2007) as an improvement of the Kayser’s work, in which two more features of orientation and pitch are included and a biologically inspired nonlinear local normalization algorithm is used for multi-scale feature integration. Its performance is tested in sound signals of read speech and is proved to be more accurate with 75.9% and 78.1% accuracy on detecting prominent syllables and words. However, in Kalinli’s work, the sound sources are selected from the broadcast read speech database and no environment sound tracks from real world have been used. The third model is proposed by (Duangudom and Anderson, 2007), in which the spectro-temporal receptive field models and adaptive inhibition are applied to form the saliency map. Supported by the results of the experimental validations consisting of simple examples, good prediction performance can be obtained but the model is still not verified on real environmental sound data which limits its application in industrial manufacture.
Recently, (Kim et al., 2014) considered the Bark-frequency loudness based optimal filtering for auditory salience detection and researched on the collecting annotations of salience in auditory data, in which linear discrimination was used. Though the experiment results shown 68.0% accuracy, the sound signals for validation are collected from meeting room recordings. This means that only indoor environment is considered. (Kaya and Elhilali, 2012) proposed a temporal saliency map approach for salient sound detection based on five simple features, in which the saliency features only from time domain are considered. Though the test results have shown that this method is superior to Kayser’s work, the lack of frequency contrast and other auditory information limit its application. (Dennis et al., 2013b) proposed a salient sound detection approach based on the keypoints of local spectrogram feature for further sound event recognition in which the overlapping effect and noise conditions are considered. Though the experiment results have shown a clear detection output on multiple sound sources, only simple sound examples are used for experimental test and real environment sound signals are not included.
Environmental Audio Information Perception
So far, many research works have been done in developing approaches that could automatically analyze and recognize the sound signal. Currently, all the works are mostly based on the characteristics, or rather the features, of sound. The features of sound that used in early works are often referring to the descriptors that can be calculated by using sound data from time or frequency domain, such as root mean square, zero crossing rate, band energy ratio and spectral centroid (Widmer et al., 2005). However, although these kinds of features can be easily derived from sound data, they are still lower-level features that need sophisticated approaches to process and could not represent the human acoustic characteristic for higher-level fusion work. According to (Cowling and Sitte, 2003) the sound signal from natural world can be either stationary or non-stationary, a more general way of categorizing sound features proposed by (Chachada and Kuo, 2013) is to divide them into two aspects according to the differences of the extraction methods, which are as follow: 1) Stationary features:
Stationary features including both temporal and spectral features, such as the Zero-Crossing Rate (ZCR), Short-Time Energy (STE), Sub-band Energy Ratio and Spectral Flux (Mitrovic et al., 2010), which are frequently used because they are easy to compute and could be concatenated with other features. Meanwhile, as the computational resemblance of human auditory perception system, the Mel Frequency Cepstrum Coefficient (MFCC) is often used in most of the human-voice related audio signal processing scenarios like speech and music recognition. Other widely used cepstral features including Linear Predictive Cepstral Coding (LPCC), Homomorphic Cepstral Coefficients (HCC), Bark-Frequency Cepstral Coefficients (BFCC) and the first derivative of MFCC which is ∆MFCC as well as the second derivative of MFCC (∆∆MFCC).
2) Non-stationary features:
In most of the research works, non-stationary features are referring to: a) Time-frequency features derived from time-frequency domain of audio signal, which are the spectrograms generated by Short-time Fourier transform (STFT) as well as the scalogram generated by Discrete Wavelet Transform (DWT) or Continuous Wavelet Transform (CWT). b) Sparse representation based features which are extracted by the Matching Pursuit (MP), Orthogonal Matching Pursuit (OMP) approach based on the atoms in an over-complete dictionary that consists of a wide variety of signal basis ((Chu et al., 2008), (Chu et al., 2009) and (Zubair et al., 2013)). c) Other time-frequency features, such as pitch range (Uzkent et al., 2012).
Recently, many works like (Mukherjee et al., 2013), (Chia Ai et al., 2012), (Chauhan et al., 2013) and (Samal et al., 2014) have been conducted in extracting the sound feature by using the computational model for further recognition processing, in which the MFCC and LPCC are frequently used to extract the human acoustic features from sound signal. In (Nirjon et al., 2013) a novel approach combined sparse Fast Fourier Transform which is proposed by (Hassanieh et al., 2012) with MFCC feature is proposed for extracting a highly accurate and sparse acoustic feature which could be used on mobile devices, the capability of this approach is proved to be efficient by the experimental results and could be expected to run in real-time at a high sampling rate. In the research work of (Murty and Yegnanarayana, 2006) a speaker recognition method based on the fusion of residual phase information and MFCC by using the auto associative neural network (AANN) (described in (Yegnanarayana and Kishore, 2002)) model is demonstrated and tested to be helpful in improving the performance of conventional speaker recognition system, experimental results show that the equal error rate of individually MFCC and residual phase is improved significantly. Meanwhile, similar research works are also carried out in (Nakagawa et al., 2012), in which the phase information is extracted by normalizing the change variation in the phase according to the frame position of input speech and combined with MFCC to accomplish the goal of speaker identification and verification. Moreover, (Hossan et al., 2010) propose an improved Discrete Cosine Transform (DCT) and MFCC fusion based acoustic feature with a Gaussian Mixture Model (GMM) classifier for speaker verification, experimental results show that the identification accuracy is better even with a lower number of features and the computational time is reduced as well. Furthermore, a saliency-based audio features fusion approach for audio event detection and summarization is presented in (Zlatintsi et al., 2012), where the local audio features are fused by linear, adaptive and nonlinear fusion schemes to create a unified saliency curve for audio event detection. The events combined with audio saliency features used for audio summarization, however, are manually segmented.
Nevertheless, despite all the research works in human voice related sound signal processing by using the acoustic feature of MFCC and other information proved to be successful, the research work of (Allen, 1994) shows that there is little biological evidence for frame-based features like MFCC, and that the human auditory system may be based on the partial recognition of features that are local and uncoupled across frequency, which means the local spectrogram features could be used to simulate the human recognition process. In other words, the spectrogram based spectrum image can be seen as the visual representation of sound signal and thus certain prominent image region could be processed as the image feature of sound for audio signal classification and recognition work.
Environmental Sound Recognition
Similar to the musical signals, the audio signals collected from environment can be also seen as the combination of foreground and background sounds (Raj et al., 2007). In most of the perception scenarios, foreground sounds are composed of signals that related to the specific acoustic events or objects while background sounds are more commonly generated by the acoustic scene. Therefore, the research orientations of non-speech environmental sound recognition can be divided into two categories: a) event based sounds, b) scene based sounds.
Though the research of sound scene recognition (see (Su et al., 2011) as an example) is important for understanding the environment, the scene based sounds can only be used to recognize the scene and are not able to provide comprehensive information for artificial awareness to achieve the perception task. Therefore, the recognition of event based sounds is more suitable for perceiving the information of what is happening in the environment and makes the artificial awareness possible.
Current research works of environmental sound recognition can be found in many application scenarios, such as audio retrieval ((Lallemand et al., 2012) and (Mondal et al., 2012)), robot navigation (Chu et al., 2006), surveillance (Sitte and Willets, 2007), home automation (Wang et al., 2008) and machine awareness (Wang et al., 2013a). Though the above mentioned works have proved that environmental sound recognition is essential and crucial to human life as well as the artificial awareness, most of the approaches are still focus on the theoretical level recognition works which based on the combination of previous mentioned audio features.
Recently, similar works like (Souli and Lachiri, 2011), (Souli and Lachiri, 2012a) and (Souli and Lachiri, 2012b), applied the non-linear visual features which extracted by log-Gabor filtered spectrogram and support vector machines approach to recognize the environmental sound. Note that, though experimental results show good performance of the proposed methods, the advantages and shortcomings between the log-Gabor filtered spectrogram and raw spectrogram are still need to be researched. Another spectrum image based sound signal processing application is for sound-event recognition which are demonstrated in (Kalinli et al., 2009) and (Janvier et al., 2012), both of which are motivated by the saliency based human auditory attention mechanism and the fusion of auditory image representation in spectro-temporal domain and other low level features of sound is applied. The experimental test data are collected from the actual audio events which occurring in real world and results show better performances of the proposed approaches with less computational resources. In (Lin et al., 2012) a saliency-maximized spectrogram based acoustic event detection approach is proposed, in which the spectrogram of audio is used as a visual representation that enables audio event detection can be applied into visually salient patterns, in which the visualization is implemented by a function that transform the original mixed spectrogram to maximize the mutual information between the label sequence of target events and the estimated visual saliency of spectrogram features. Although results of experiments which involved human subjects indicate that the proposed method is outperforming than others, since computational speed is prior considered in this method, the research of visual saliency features automatically extraction of audio events is not further conducted which lead to a limitation in robotic application.
Audio-Visual Fusion for Heterogeneous Information of sound and image
Considering the information fusion research issue of artificial machines based on sound and image signals in recent years, though there have been a lot of works proved to be progressive in realizing autonomous object or event recognition for environment perception, most of them are still rely on the homogeneous features which collected from multi-sensors that provide the non-heterogeneous information for further fusion work, such as visual image and infrared image, data from sonar and laser radar, to name a few (see (Han et al., 2013b) and (Pérez Grassi et al., 2011) as examples). It is obvious that, though image and sound signals are able to provide various information of surrounding environment alone, both of them have some limitations and shortcomings in perception compared to each other. For example, visual image is generated by the reflection of sun light, it is distinguishable because it consists of intuitive and unique representation of objects and the visual description is vivid and comprehensive because of the complexity in color, contrast and shape, except that visual image is very sensitive to obstacles, masks and lighting levels. At the same time, as a wave formed signal, sound is able to provide extra information of distance and location of the sound source as a highlight beyond the advantages of visual image information, and robust to obstacles compared with image. However, auditory data is not visualized information and need sophisticated computational model to process which lead to the limitation of distortion when noise exists. Consequently, the fusion of heterogeneous information of sound and image still has not been widely researched due to the lack of computational model and state of the art techniques. Currently, one popular application in audio-visual fusion research field is known as the subject of speaker identification and tracking in which auditory or acoustic information is used only as low level features or supplementary information in these works. For example, series of works in (Fisher et al., 2000) and (Fisher and Darrell, 2004) describe joint statistical models that represent the joint distribution of visual and auditory signals in which the maximally informative joint subspaces learning approach based on the concept of entropy for multi-media signal analysis is presented. This approach is purely a signal-level fusion technique, nonparametric statistical density modeling techniques are used to represent complex joint densities of projected signals and to characterize the mutual information between heterogeneous signals, thus indicate whether if the heterogeneous signals belong to a common source, or more specifically, a single user. The method is tested on simulated as well as the real data collected from real world, and experimental results show that significant audio signal enhancement and video source localization capability could be achieved by using the suggested approach. Another hybrid approach is presented in (Chu et al., 2004), in which a multi-stage fusion framework is applied to combine the advantages from both feature fusion and decision fusion methodologies by exploiting the complementarity between them, in which the information from auditory and visual domain are fused in decision level by using a multi-stream Hidden Markov Model (HMM). Indoor environment audio-visual speech database consists of full-face frontal video with uniform background and lighting is used for recognition experiment and the results show that the proposed fusion framework outperforms the conventional fusion system in human voice recognition processing. Nonetheless, the typical natural environmental sounds are not considered in this approach.
Considering the application scenarios in robotics, (Ruesch et al., 2008) introduce a multi-modal saliency based artificial attention system, in which the saliency principle is used to simulate the bottom-up pattern of human beings and act as the linking bridge between sound and image information. The system combines spatial visual and acoustic saliency maps on a single egocentric map to aggregate different sensory modalities on a continuous spherical surface and yield the final egocentric and heterogeneous saliency map, by using which, the gaze direction of the robot is adjusted toward the most salient location and exploration of the environment is carried out. Though the experimental results obtained from real world test show that the gaze and react of robot to salient audio-visual event shift automatically and naturally, the acoustic saliency information is used merely as the location of the sound event instead of the semantic knowledge that can indicate which of what is happening in the environment, and the research work of learning new objects or events is not yet presented. In (Li et al., 2012) a two stages audio-visual fusion approach for active speaker localization is presented, in which the fusion of audio-visual information is applied to avoid false lip movement in the first stage and a Gaussian fusion method is proposed to integrate the estimates from both modalities of audio and visual. This approach is tested in a human-machine interaction scenario in which a human-like head is deployed as the experimental platform, test results show that significant increased accuracy and robustness for speaker localization is achieved compared to the audio/video modality alone. However, the fusion approach is carried out merely in the decision level or rather, the knowledge level instead of feature level, while the recognition work of natural sound from surrounding environment is still not involved in this research.
In general, it should be emphasized that, although some research works which focus on the fusion of audio-visual information has been conducted in feature level or decision level (i.e. knowledge level), the fusion methodologies been developed are mostly based on the complementarity of heterogeneous information. In these works, the sound information is used as a local information or extra information that to be fused with the dominant image information in some application scenarios, while as global knowledge information in others, yet the substantial fusion works or practical approaches that regarding heterogeneous information as equally important sources for intelligent perception of environment are still not well researched. To the best of my knowledge, there has been no concrete work to be accomplished to realize artificial awareness and intelligent perception for machines in complex environment based on the fusion of heterogeneous information in an autonomous way.
Table of contents :
CHAPTER 1. GENERAL INTRODUCTION
1.2. BIOLOGICAL BACKGROUND
1.2.1. Human Perception System
1.2.2. Selective Attention Mechanism
1.3. MOTIVATION AND OBJECTIVES
1.5. ORGANIZATION OF THESIS
CHAPTER 2. SALIENT ENVIRONMENTAL INFORMATION PERCEPTION – STATE OF THE ART
2.2. AUTONOMOUS OBJECT PERCEPTION USING VISUAL SALIENCY
2.2.2. Object Detection using Visual Saliency
184.108.40.206. Basic Visual Saliency Model
220.127.116.11. State of the Art Methods
2.2.3. Autonomous Object Recognition and Classification
18.104.22.168. Image Feature Acquisition
22.214.171.124.1. Local point feature
126.96.36.199.2. Statistical Feature
188.8.131.52. Current Recognition Models
2.3. SALIENT ENVIRONMENT SOUND PERCEPTION
2.3.1. General Introduction
2.3.2. Auditory Saliency Detection
184.108.40.206. Saliency Detection Models
2.3.3. Environmental Audio Information Perception
220.127.116.11. Feature Extraction of Audio Signal
18.104.22.168. Environmental Sound Recognition
2.4. AUDIO-VISUAL FUSION FOR HETEROGENEOUS INFORMATION OF SOUND AND IMAGE
CHAPTER 3. THE DETECTION AND CLASSIFICATION OF ENVIRONMENTAL SOUND BASED ON AUDITORY SALIENCY FOR ARTIFICIAL AWARENESS
3.2. OVERVIEW OF THE APPROACH
3.3. HETEROGENEOUS SALIENCY FEATURES CALCULATION
3.3.1. Background Noise Estimation
22.214.171.124. Shannon Entropy
126.96.36.199. Short-term Shannon Entropy
3.3.2. Temporal Saliency Feature Extraction
188.8.131.52. MFCC based Saliency Calculation
184.108.40.206. Computational IOR Model for Feature Verification
3.3.3. Spectral Saliency Feature Extraction
220.127.116.11. Power Spectral Density
18.104.22.168. PSD based Saliency Calculation
3.3.4. Image Saliency Detection from Spectrogram
22.214.171.124. Log-Scale Spectrogram
126.96.36.199. Image Saliency Calculation based on Opponent Color Space
3.3.5. Heterogeneous Saliency Feature Fusion
3.4. MULTI-SCALE FEATURE BASED SALIENT ENVIRONMENTAL SOUND RECOGNITION
3.4.1. General Introduction
3.4.2. Multi-Scale Feature Selection
188.8.131.52. Fuzzy Set Theory
184.108.40.206. Fuzzy Vector based Feature Extraction
220.127.116.11. Acoustic Features Calculation
3.4.3. Classification Approach
3.5.1. Validation of Salient Environmental Sound Detection
18.104.22.168. Data Setup
22.214.171.124. Experimental Protocol
126.96.36.199. Verification Results and Discussion
3.5.2. Experiments of Real Environmental Sound Recognition
188.8.131.52. Experiment Setup
184.108.40.206. Experimental Protocol
220.127.116.11. Recognition Results
CHAPTER 4. SALIENT INFORMATION BASED AUTONOMOUS ENVIRONMENTAL OBJECT DETECTION AND CLASSIFICATION
4.2. OVERVIEW OF THE APPROACH
4.3. SPARSE REPRESENTATION BASED SALIENT ENVIRONMENTAL OBJECT DETECTION
4.3.1. Image Feature Extraction
18.104.22.168. Gabor Filter
22.214.171.124. 2-D Gabor Feature Extraction
4.3.2. Visual Saliency Detection
4.3.3. Sparse Representation based Foreground Objectness Detection
126.96.36.199. Background Dictionary Learning
188.8.131.52. Foreground Object Detection based on Representation Error
4.3.4. Fusion Discrimination
4.4. SALIENT FOREGROUND ENVIRONMENTAL OBJECT CLASSIFICATION
4.4.1. General Introduction
4.4.2. Object Feature Extraction
4.4.3. Model Training
4.5. SIMULATION EXPERIMENTS
4.5.1. Experiment Setup
184.108.40.206. Data Setup
220.127.116.11. Experimental Protocol
4.5.2. Experiments Result and Discussion
CHAPTER 5. HETEROGENEOUS INFORMATION FUSION FRAMEWORK FOR HUMAN-LIKE PERCEPTION OF COMPLEX ENVIRONMENT
5.2. PROPOSAL OF FRAMEWORK
5.3. PROBABILITY TOPIC MODEL BASED HETEROGENEOUS INFORMATION REPRESENTATION
5.3.1. Probability Topic Model
5.3.2. Information Probability Model
5.3.3. Heterogeneous Information Modeling
5.3.4. Scene Information Probability Model
5.4. HETEROGENEOUS INFORMATION FUSION BASED ENVIRONMENT PERCEPTION
5.4.2. Negative Property based Complex Scene Modeling
5.4.3. Heterogeneous Model Fusion based Perception
5.5. EXPERIMENTAL VALIDATION
5.5.1. Experiment Setup
5.5.2. Experiments Result and Discussion