Representation based on variants of Gaussian filterings

Get Complete Project Material File(s) Now! »

An overview of representing DTs based on Gaussian-filtered features

Filter-bank approaches, which have been early applied to texture analysis [31], have had promising performance in DT recognition thanks to mitigating the negative influences of noise and other factors on the video encoding. It can be figured out a general diagram for the filtering-based analysis of a given video as shown in Figure 1.5, which includes three main following stages for video representation.
1. Addressing robust filters for filtering a given video V to reduce the negative impacts of the prob-lems of DT encoding. This process is applied to the video analysis as a pre-processing in order to point out V’s filtered outcomes f igni=1.
2. For each filtered outcome i, a histogram h( i) for representing i is formed by using a local operator to capture spatio-temporal features.
3. Structuring a final descriptor for DT representation by simply concatenating the complementary features of all obtained histograms fh( i)g.
It can be verified that improvement of the performance majorly depends upon the processes of both filtering and local encoding in the entire framework. Being aware of this substance, we have proposed in this thesis robust filters as well as discriminative local operators as follows. For the filtering, we have addressed the Gaussian kernel, its high-order gradients, and other Gaussian-based variants in order to thoroughly investigate various benefits of robust Gaussian-based filtered features for DT representation (refer to Chapter 6). For the local operators to capture features from filtered outcomes pointed out by the above fil-terings, we have inherited and developed some significant ones as follows (refer to Chapter 3 for further expression).
– CAIP, a crucial adaptation of completed local binary patterns, for efficiently dealing with an issue of close-to-zero pixels caused by the bipolar features of Gaussian-filtered outcomes.
– LRP, based on local neighbors, which are sampled on a rubik cube centering at a voxel, in order to enrich informative structures for improvement of the discrimination power.
– CHILOP, exploiting a pairwise of adjacent supporting areas in completed context of analysis to capture hierarchical local patterns for DT representation.
It could be seen that our proposals have taken advantage of both filter-based and local-feature-based properties. Therein, the proposed filters are directly applied to the video filtering. Contrary to sev-eral existing methods, ours are non-learned filters which allow to take less the computational cost for the pre-processing, while the obtained responses are robust against the well-known problems in DT represen-tation. On the other hand, our particular operators have assisted to enhance the performance compared to the conventional ones. Indeed, experiments in DT recognition have shown that our proposals have very good performance compared to all non-deep-learning models, while being close to deep-learning approaches. Furthermore, most of them can be potential solutions for mobile applications in practice.

Our main contributions

According to our general concepts for DT representation which are mentioned in Sections 1.2 and 1.4, it could be shortly listed our crucial contributions as follows. Contribution #1: The novel local operators are proposed to be adopted with different contexts of video encoding: xLVP, xLDP, CAIP, LRP, and CHILOP (refer to Chapter 3). Some of them are applied to representing DTs of raw videos as published in the conference papers [C1, C3]. Some others are used to encode motion points of dense trajectories extracted from a given video (refer to Chapter 4), and filtered responses computed by the filterings (refer to Chapters 5 and 6). Contribution #2: The thesis introduces a new approach for DT representation by capturing spatio-temporal features of motion points subject to the paths of dense trajectories which are extracted from a given video (refer to Chapter 4). Our contributions have been published in the conference paper [C4] and the journal article [J2]. Contribution #3: Motivated the moment-image model for textural image description, we propose a novel model, named moment volumes, which its responses are computed by local spherical supporting regions instead of the circle-based ones. It have been proved to be more adaptive for video analysis than the moment-image model. In addition, the moment-image model is also investigated on three orthogonal planes of a video to point out mean and variance filtered images for DT representation (refer to Chapter 5). Our contributions have been published in the journal articles [J1, J5].
Contribution #4: Addressing the Gaussian kernel for the video filtering to point out robust Gaussian-filtered outcomes for DT representation. Firstly, we investigate the benefits of single-sale standard deviations, published in the conference papers [C2,C5]. The complementary components of multi-sale Gaussian filterings are then proposed to forcefully capture more rich information. The obtained responses are encoded by our proposed descriptors, LRP (see Section 6.5), CHILOP (see Section 6.4), which the LRP-based results have been published in the journal article [J4] while the CHILOP-based ones have been in minor revision in the journal article [S2]. In another aspect, the difference of Gaussians (DoG) is also conducted in this thesis for DT encoding. Due to the close-to-zero problem caused by decomposing the DoG responses into bipolar components, our CAIP operator is then addressed to deal with that. In addition, the bipolar DoG-filtered features are inte-grated along with the Gaussian-filtered ones in multi-scale analysis to figure out a discriminative descriptor (see Section 6.6). This result has been under review in the journal article [S4].
Contribution #5: The thesis also takes advantages of partial derivatives of the Gaussian kernel into account the video filterings. We conduct the effect of these filtering kernels in high-order and multi-scale analysis to enrich more robust patterns for DT representation (see Section 6.8). The influential improvement has been published in the journal article [J3]. Especially, we pro-pose a novel filtering kernel based on the difference of high-order Gaussian gradients (DoDG), which allows to point out responses in more robustness (see Section 6.9). Addressing the DoDG responses for local DT encoding allows to structure prominent descriptors in small dimension, which are expected as one of crucial solutions for mobile applications and embedded sensor sys-tems in practice (see Section 6.3). The substantial contribution of DoDG has been under review in the journal article [S1]. In another aspect, the Gaussian gradients are thoroughly discussed in their separately bipolar-based features (under review in the journal article [S5]) and their oriented magnitudes which have been under minor revision in the journal article [S3] (see Section 6.7).

Analyzing DTs based on optical flow

Taking advantage of the efficient computation and video encoding in natural way, optical-flow-based methods for DT representation have obtained remarkable performance [34–36]. To shape and trace the path of a motion in a sequence, Peh et al. [34] aggregated spatio-temporal textures formed by magnitudes and directions of the normal flow which are essential to identify motion types. Peteri´ et al. [35] presented a qualitative approach based on the normal vector field and criteria of videos to describe DT features. In another work, these authors combined the normal flow with filtering regularity to capture the revealing properties of DTs [36]. In the meanwhile, Lu et al. [37] utilized the velocity and acceleration properties estimated by a structure tensor to form spatio-temporal multi-resolution histogram. As discussed by Rivera et al. [38], due to assumption of brightness constancy and local smoothness, the optical-flow-based methods are usually considered as not to be suitable for stochastic DTs in reality. Moreover, just motion features of DTs are encoded while their textures and appearances have not been regarded.
Addressing those problems, we have proposed to take advantage of profitable characteristics of both optical-flow-based and local-feature-based features by i) exploiting Features of Directional Trajectory (FDT) in accordance with Motion Angle Patterns (MAP) for addressing local characteristics and angle information of motion points which are along the paths of dense trajectories of a DT sequence [C4]; ii) using our discriminative operator, xLVP (see Section 3.4), proposed for capturing local features from motion points along with their dense trajectories extracted from a given video [J2] (see Chapter 4 for further presentation).

Linear Dynamical Systems (LDS)

Doretto et al. [39] laid the foundation for model-based methods with a typical model of Linear Dynamical System (LDS). For a given DT video V with a set of (T +1) frames as FV = y0; y1; :::; yT , yt 2 Rm. In general, the evolution of a LDS is usually presented as ( xt+1 = Axt + vt yt = Cxt + wt
where yt 2 Rm and xt 2 Rn denote the observation and its hidden “state” with initial condition x0 2 Rn; vt 2 Rn and wt 2 Rm are independent and identically distributed sequences drawn from known distributions; A 2 Rn n and C 2 Rm n are the system parameters of matrices for estimations.

Modeling DTs based on LDS

Inspired by the idea of the typical LDS model, many works have taken it into account DT represen-tation for recognition tasks as well as for other problems in computer vision. Saisan et al. [5] agreed the “state” noise vt and the observation noise wt with the distributions of zero-mean Gaussian noise levels for representing DTs. Chan et al. [40] utilized kernel-PCA (Principal Component Analysis) to model the LDS’s observation matrix C as a non-linear function to apprehend characteristics of dynamic features in complex motions, such as chaotic motions (e.g., turbulent water) and camera motions (e.g., panning, zooming, and rotations). Later, to capture the motions of objects in sequences, they presented a model of DT mixtures (DTMs) based on the LDS’s concept. The outputs are then fed into an algorithm of hierarchical expectation-maximization (HEM-DTM) in order to categorize DTMs into k clusters for DT description [41]. Also based on the LDS model, Wang et al. [42] made it in accordance with a bag-of-words (BoW) method to extract chaotic features in videos while Ravichandran et al. [43] based on bag-of-systems (BoS) to form the corresponding spatio-temporal patterns. To enhance the speed of performing BoS’s codebooks, Mumtaz et al. [44] proposed BoS Tree, in which a bottom-up hierarchy is constructed for indexing the codewords. Recently, Wei et al. [45] combined the LDS model with the sparse coding technique to develop a joint dictionary learning framework for modeling DT sequences. In terms of efficiency, the model-based methods have usually achieved modest results on DT recognition because their major drawback is that their encoding mostly concentrates on the spatial-appearance-based characteristics of DTs rather than the dynamic-based ones [5]. Furthermore, efforts taking them into account dynamic features can make the models more complex [43].

A brief of fractal analysis

Fractal analysis is built on the concept of fractal dimension which is firstly proposed by Mandelbrot
[46] as the measurement of power law existing in many natural phenomena. For a non-empty bounded subset E Rn, let Nr(E) be the smallest number of sets of diameter r that can cover E. The fractal dimension of E is defined as the following [47]: dim(E) = lim log Nr(E) (2.3) log r r!0.
In practical implementations, it can consider the space as a mesh of boxes of size r, called the r-mesh boxes, and count these boxes occupied by the point set [48]. Due to this computation, the above fractal formation is located as the box-counting dimension. Further transformations as well as specific instances could be referred to [46–48] for more detail.

DT representation based on fractal analysis

Given a gray-scale DT sequence, Xu et al. [49, 50] introduced Dynamic Fractal Spectrum (DFS) based on a fractal analysis of the following integrated measures: pixel intensity, temporal brightness gradient, normal flow, Laplacian, and principal curvature, which are computed subject to a 3D cube centering at a voxel with different values of spatial and temporal radii. An extension of DFS was also proposed in Multi-Fractal Spectrum (MFS) [51] by a combination of capturing stochastic self-similarities and analyzing fractal patterns of DT sequences. However, only spectral information is considered in those works, while spatial domain has been less regarded with. Ji et al. [52] addressed this drawback by embedding spatial appearance analysis into MFS in accordance with wavelet coefficients to form Wavelet-based MFS (WMFS) for representing DTs in more effect. In another viewpoint, Quan et al. [53] based on the concept of lacunarity, a specialized aspect in fractal geometry for measuring how patterns fill space, in order to propose Spatio-Temporal Lacunarity Spectrum (STLS) descriptor where lacunarity-based features are captured by applying lacunarity analysis to local binary patterns in DT slices. In terms of effectiveness in DT recognition, experiments have shown that the geometry-based methods principally have good performances on simple datasets, e.g., UCLA [5], but not on the more challenging ones, e.g., DynTex [54] and DynTex++ [55]. It may be due to lack of temporal information involved in their encodings.

Learning-based methods

Learning-based methods have been growing into potential approaches as their noteworthy perfor-mance in DT recognition. In general, they are usually arranged into two trends: The first one is based on deep learning techniques; the rest is based on dictionary learning. Hereafter, we take a look of these applied to learn features for DT representation.

Deep-learning-based techniques

In 1990’s, LeCun et al. [56, 57] firstly proposed a Convolutional Neural Network (CNN) for hand-written digit recognition. However, until 2012, CNN has been popularized in computer vision when Krizhevsky et al. [1] introduced AlexNet which its learning model is very similar to the architecture of LeNet but in deeper, bigger, and featured convolutional layers (see Figure 2.1 for a graphical architecture of AlexNet in general). This popularization is partly thanks to the development of computer hardware ar-chitecture with high computational performance. After that, many deep learning models based on CNN’s architecture have been proposed to solve different applications in computer vision. The most common ones can be listed such as ZF Net [58], GoogLeNet [59], VGGNet [60], ResNet [61], etc. For learning DTs, Qi et al. [62] adopted AlexNet [1] as a feature extractor to extract mid-level patterns from each frame of a given sequence, and then formed a corresponding DT descriptor by concatenating the first and the second order statistics over the mid-level features, named Transferred ConvNet Features (TCoF). Andrearczyk et al. [63] took AlexNet [1] and GoogLeNet [59] into account video analysis to ex-tract DT features (DT-CNN) from three orthogonal planes of a given video. In the meanwhile, Arashloo et al. [64] adopted PCANet [65], a CNN-based model using PCA learned filters for the convolving pro-cess, in order to construct a multi-layer convolutional architecture involved with three orthogonal planes of a DT video (PCANet-TOP). Lately, a deep dual descriptor [66] is based on characteristics of “key frames” and “key segments” to learn static and dynamic features. Besides, Hadji et al. [4] composed a new challenging large scale dataset, named DTDB (see Section 2.8.4 for its detail expression). They then attempted to implement some deep learning methods for learning DTs on DTDB: Convolutional 3D (C3D) [67], RGB/Flow Stream [68], Marginalized Spatio-temporal Oriented Energy (MSOE) in two learning streams (MSOE-two-Stream) [4].

Table of contents :

Author’s publications
R´esum´e
Abstract
Acknowledgment
List of Figures
List of Tables
1 Introduction
1.1 Dynamic textures: definition, challenges, and applications
1.2 An overview of representing DTs based on dense trajectories
1.3 An overview of representing DTs based on moment-based features
1.4 An overview of representing DTs based on Gaussian-filtered features
1.5 Our main contributions
1.6 Outline of thesis
2 Literature review
2.1 Introduction
2.2 Optical-flow-based methods
2.2.1 A brief of optical-flow concept
2.2.2 Analyzing DTs based on optical flow
2.3 Model-based methods
2.3.1 Linear Dynamical Systems (LDS)
2.3.2 Modeling DTs based on LDS
2.4 Geometry-based methods
2.4.1 A brief of fractal analysis
2.4.2 DT representation based on fractal analysis
2.5 Learning-based methods
2.5.1 Deep-learning-based techniques
2.5.2 Dictionary-learning-based techniques
2.6 Filter-based methods
2.6.1 DT description based on learned filters
2.6.2 DT description based on non-learned filters
2.7 Local-feature-based methods
2.7.1 A brief of LBP
2.7.2 A completed model of LBP (CLBP)
2.7.3 Completed local structure patterns (CLSP), a variant of CLBP
2.7.4 LBP-based variants for textural image description
2.7.5 LBP-based variants for DT representation
2.8 Datasets and protocols for evaluations of DT recognition
2.8.1 UCLA dataset
2.8.2 DynTex dataset
2.8.3 DynTex++ dataset
2.8.4 DTDB dataset
2.9 Classifiers for evaluating DT representation
3 Proposed variants of LBP-based operators
3.1 Introduction
3.2 Completed AdaptIve Patterns (CAIP)
3.3 Some extensions of Local Derivative Patterns (xLDP)
3.3.1 Local Derivative Patterns
3.3.2 Adaptative directional thresholds
3.3.3 Completed model of LDP
3.3.4 Assessing our proposed extensions of LDP
3.4 Some extensions of local vector patterns (xLVP)
3.4.1 Local Vector Patterns
3.4.2 Adaptive directional vector thresholds
3.4.3 A completed model of LVP
3.5 Local Rubik-based Patterns (LRP)
3.5.1 Complemented components
3.5.2 Construction of LRP patterns
3.6 Completed HIerarchical LOcal Patterns (CHILOP)
3.6.1 Construction of CHILOP
3.6.2 A particular degeneration of CHILOP into CLBP
3.6.3 Beneficial properties of CHILOP operator
3.7 Summary
4 Representation based on dense trajectories
4.1 Introduction
4.2 Dense trajectories
4.3 Beneficial properties of dense trajectories
4.3.1 Directional features of a beam trajectory
4.3.2 Spatio-temporal features of motion points
4.4 Directional dense trajectory patterns for DT representation
4.4.1 Proposed DDTP descriptor
4.4.2 Computational complexity of DDTP descriptor
4.5 Experiments and evaluations
4.5.1 Experimental settings
4.5.2 Experimental results
4.5.2.1 Recognition on UCLA dataset
4.5.2.2 Recognition on DynTex dataset
4.5.2.3 Recognition on DynTex++ dataset
4.5.3 Global discussion
4.6 Summary
5 Representation based on moment models
5.1 Introduction
5.2 Moment models
5.2.1 Moment images
5.2.2 A novel moment volumes
5.2.3 Advantages of moment volume model
5.3 DT representation based on moment images
5.4 DT representation based on moment volumes
5.4.1 Proposed momental directional descriptor
5.4.2 Enhancing the performance with max-pooling features
5.5 Experiments and evaluations
5.5.1 Experimental settings
5.5.2 Assessment of effectiveness of moment models
5.5.3 Experimental results of MDP-based descriptors
5.5.3.1 Recognition on UCLA dataset
5.5.3.2 Recognition on DynTex dataset
5.5.3.3 Recognition on Dyntex++ dataset
5.5.3.4 Assessing the proposed components: Recognition with MDP-B and LDP-TOP
5.5.3.5 Assessing impact of max-pooling features: Recognition with EMDP descriptor
5.5.4 Global discussion
5.6 Summary
6 Representation based on variants of Gaussian filterings
6.1 Introduction
6.1.1 Motivation
6.1.2 A brief of our contributions
6.2 Gaussian-based filtering kernels
6.2.1 A conventional Gaussian filtering
6.2.2 Gradients of a Gaussian filtering kernel
6.3 A novel kernel based on difference of Gaussian gradients
6.3.1 Definition of a novel DoDG kernel
6.3.2 Beneficial properties of DoDG compared to DoG
6.4 Representation based on completed hierarchical Gaussian features
6.4.1 Construction of Gaussian-filtered CHILOP descriptor
6.4.2 Experiments and evaluations
6.4.2.1 Parameters for experimental implementation
6.4.2.2 Assessments of CHILOP’s performances
6.5 Representation based on RUbik Blurred-Invariant Gaussian features
6.5.1 Benefits of Gaussian-based filterings
6.5.2 Construction of RUBIG descriptor
6.5.3 Experiments and evaluations
6.5.3.1 Parameters for experimental implementation
6.5.3.2 Assessments of RUBIG’s performances
6.6 Representation based on Gaussian-filtered CAIP features
6.6.1 Completed sets of Gaussian-based filtered outcomes
6.6.2 Beneficial properties of filtered outcomes 2D=3D
6.6.3 DT description based on complementary filtered outcomes 2D=3D
6.6.4 Experiments and evaluations
6.6.4.1 Parameters for experimental implementation
6.6.4.2 Assessments of DoG-based features compared to those of FoSIG and V-BIG
6.6.4.3 Assessments of LOGIC2D=3D’s performances
6.7 Representation based on oriented magnitudes of Gaussian gradients
6.7.1 Oriented magnitudes of Gaussian gradients
6.7.2 DT representation based on oriented magnitudes
6.7.3 Experiments and evaluations
6.7.3.1 Parameters for experimental implementation
6.7.3.2 Assessments of effectiveness of decomposing models
6.7.3.3 Assessments of MSIOMFk;D4 and MSVOMFk;D4
6.8 Representation based on Gaussian-gradient features
6.8.1 High-order Gaussian-gradient Filtered Components
6.8.2 DT Representation Based on 2D=3D H; Components
6.8.3 Experiments and evaluations
6.8.3.1 Parameters for experimental implementation
6.8.3.2 Assessments of High-order Gaussian-gradient Descriptors
6.8.3.3 Comprehensive Comparison to Non-Gaussian-gradients
6.9 Representation based on DoDG-filtered features
6.9.1 Construction of DoDG-filtered descriptors
6.9.2 Experiments and evaluations
6.9.2.1 Parameters for experimental implementation
6.9.2.2 Assessments of DoDG-based descriptors
6.9.2.3 Comprehensive comparison to DoG-based descriptors
6.10 Comprehensive evaluations in comparison with existing methods
6.10.1 Benefits of Gaussian-based filterings
6.10.1.1 Robustness to the well-known issues of DT description
6.10.1.2 Rich and discriminative features of Gausian-gradient-based filterings
6.10.2 Complexity of our proposed descriptors
6.10.3 Comprehensive discussions of DT classification on different datasets
6.10.3.1 Classification on UCLA
6.10.3.2 Classification on DynTex
6.10.3.3 Classification on DynTex++
6.10.3.4 Classification on DTDB dataset
6.11 Global discussions
6.11.1 Further evaluations for Gaussian-gradient-based descriptors
6.11.2 Evaluating appropriation of our proposals for real applications
6.12 Summary
7 Conclusions and perspectives
7.1 Conclusions
7.2 Perspectives
Bibliography