Get Complete Project Material File(s) Now! »

## Analyzing DTs based on optical flow

Taking advantage of the efficient computation and video encoding in natural way, optical-flow-based methods for DT representation have obtained remarkable performance [34–36]. To shape and trace the path of a motion in a sequence, Peh et al. [34] aggregated spatio-temporal textures formed by magnitudes and directions of the normal flow which are essential to identify motion types. Peteri´ et al. [35] presented a qualitative approach based on the normal vector field and criteria of videos to describe DT features. In another work, these authors combined the normal flow with filtering regularity to capture the revealing properties of DTs [36]. In the meanwhile, Lu et al. [37] utilized the velocity and acceleration properties estimated by a structure tensor to form spatio-temporal multi-resolution histogram. As discussed by Rivera et al. [38], due to assumption of brightness constancy and local smoothness, the optical-flow-based methods are usually considered as not to be suitable for stochastic DTs in reality. Moreover, just motion features of DTs are encoded while their textures and appearances have not been regarded.

Addressing those problems, we have proposed to take advantage of profitable characteristics of both optical-flow-based and local-feature-based features by i) exploiting Features of Directional Trajectory (FDT) in accordance with Motion Angle Patterns (MAP) for addressing local characteristics and angle information of motion points which are along the paths of dense trajectories of a DT sequence [C4]; ii) using our discriminative operator, xLVP (see Section 3.4), proposed for capturing local features from motion points along with their dense trajectories extracted from a given video [J2] (see Chapter 4 for further presentation).

### Linear Dynamical Systems (LDS)

Doretto et al. [39] laid the foundation for model-based methods with a typical model of Linear Dynamical System (LDS). For a given DT video V with a set of (T +1) frames as FV = y0; y1; :::; yT , yt 2 Rm. In general, the evolution of a LDS is usually presented as ( xt+1 = Axt + vt (2.2) yt = Cxt + wt where yt 2 Rm and xt 2 Rn denote the observation and its hidden “state” with initial condition x0 2 Rn; vt 2 Rn and wt 2 Rm are independent and identically distributed sequences drawn from known distributions; A 2 Rn n and C 2 Rm n are the system parameters of matrices for estimations.

#### Modeling DTs based on LDS

Inspired by the idea of the typical LDS model, many works have taken it into account DT represen-tation for recognition tasks as well as for other problems in computer vision. Saisan et al. [5] agreed the “state” noise vt and the observation noise wt with the distributions of zero-mean Gaussian noise levels for representing DTs. Chan et al. [40] utilized kernel-PCA (Principal Component Analysis) to model the LDS’s observation matrix C as a non-linear function to apprehend characteristics of dynamic features in complex motions, such as chaotic motions (e.g., turbulent water) and camera motions (e.g., panning, zooming, and rotations). Later, to capture the motions of objects in sequences, they presented a model of DT mixtures (DTMs) based on the LDS’s concept. The outputs are then fed into an algorithm of hierarchical expectation-maximization (HEM-DTM) in order to categorize DTMs into k clusters for DT description [41]. Also based on the LDS model, Wang et al. [42] made it in accordance with a bag-of-words (BoW) method to extract chaotic features in videos while Ravichandran et al. [43] based on bag-of-systems (BoS) to form the corresponding spatio-temporal patterns. To enhance the speed of performing BoS’s codebooks, Mumtaz et al. [44] proposed BoS Tree, in which a bottom-up hierarchy is constructed for indexing the codewords. Recently, Wei et al. [45] combined the LDS model with the sparse coding technique to develop a joint dictionary learning framework for modeling DT sequences. In terms of efficiency, the model-based methods have usually achieved modest results on DT recognition because their major drawback is that their encoding mostly concentrates on the spatial-appearance-based characteristics of DTs rather than the dynamic-based ones [5]. Furthermore, efforts taking them into account dynamic features can make the models more complex [43].

**DT representation based on fractal analysis**

Given a gray-scale DT sequence, Xu et al. [49, 50] introduced Dynamic Fractal Spectrum (DFS) based on a fractal analysis of the following integrated measures: pixel intensity, temporal brightness gradient, normal flow, Laplacian, and principal curvature, which are computed subject to a 3D cube centering at a voxel with different values of spatial and temporal radii. An extension of DFS was also proposed in Multi-Fractal Spectrum (MFS) [51] by a combination of capturing stochastic self-similarities and analyzing fractal patterns of DT sequences. However, only spectral information is considered in those works, while spatial domain has been less regarded with. Ji et al. [52] addressed this drawback by embedding spatial appearance analysis into MFS in accordance with wavelet coefficients to form Wavelet-based MFS (WMFS) for representing DTs in more effect. In another viewpoint, Quan et al. [53] based on the concept of lacunarity, a specialized aspect in fractal geometry for measuring how patterns fill space, in order to propose Spatio-Temporal Lacunarity Spectrum (STLS) descriptor where lacunarity-based features are captured by applying lacunarity analysis to local binary patterns in DT slices. In terms of effectiveness in DT recognition, experiments have shown that the geometry-based methods principally have good performances on simple datasets, e.g., UCLA [5], but not on the more challenging ones, e.g., DynTex [54] and DynTex++ [55]. It may be due to lack of temporal information involved in their encodings.

**Learning-based methods**

Learning-based methods have been growing into potential approaches as their noteworthy perfor-mance in DT recognition. In general, they are usually arranged into two trends: The first one is based on deep learning techniques; the rest is based on dictionary learning. Hereafter, we take a look of these applied to learn features for DT representation.

**Deep-learning-based techniques**

In 1990’s, LeCun et al. [56, 57] firstly proposed a Convolutional Neural Network (CNN) for hand-written digit recognition. However, until 2012, CNN has been popularized in computer vision when Krizhevsky et al. [1] introduced AlexNet which its learning model is very similar to the architecture of LeNet but in deeper, bigger, and featured convolutional layers (see Figure 2.1 for a graphical architecture of AlexNet in general). This popularization is partly thanks to the development of computer hardware ar-chitecture with high computational performance. After that, many deep learning models based on CNN’s architecture have been proposed to solve different applications in computer vision. The most common ones can be listed such as ZF Net [58], GoogLeNet [59], VGGNet [60], ResNet [61], etc.

For learning DTs, Qi et al. [62] adopted AlexNet [1] as a feature extractor to extract mid-level patterns from each frame of a given sequence, and then formed a corresponding DT descriptor by concatenating the first and the second order statistics over the mid-level features, named Transferred ConvNet Features (TCoF). Andrearczyk et al. [63] took AlexNet [1] and GoogLeNet [59] into account video analysis to ex-tract DT features (DT-CNN) from three orthogonal planes of a given video. In the meanwhile, Arashloo et al. [64] adopted PCANet [65], a CNN-based model using PCA learned filters for the convolving pro-cess, in order to construct a multi-layer convolutional architecture involved with three orthogonal planes of a DT video (PCANet-TOP). Lately, a deep dual descriptor [66] is based on characteristics of “key frames” and “key segments” to learn static and dynamic features. Besides, Hadji et al. [4] composed a new challenging large scale dataset, named DTDB (see Section 2.8.4 for its detail expression). They then attempted to implement some deep learning methods for learning DTs on DTDB: Convolutional 3D (C3D) [67], RGB/Flow Stream [68], Marginalized Spatio-temporal Oriented Energy (MSOE) in two learning streams (MSOE-two-Stream) [4].

Although the deep-learning-based approaches have often obtained significant performance for DT recognition, most of them have to take complicated algorithms into account learning a huge number of parameters in deep architectures of neural networks. For instance, as implemented by Hadji et al. [4] for DT classification issue on the recent large scale DTDB [4] dataset, 80M learned parameters are taken by C3D [67], while 88M by Two-Stream [68] and MSOE-two-Stream [4] networks. Besides, DT-CNN [63] is addressed in different sets of parameters to be fed into AlexNet and GoogLeNet frame-works which have 61M and 6.8M learned parameters respectively for learning DT features on differ-ent DT datasets. This is one of crucial barriers in order to bring the deep-learning-based ones into real applications for mobile devices as well as embedded sensor systems, those which have strictly required tiny resources for their functions. In this work, our proposed frameworks could mitigate those shortcom-ings in low computational complexity, which could be potential for mobile implementations in practice. Indeed, our proposals just utilize simple operators (refer to Chapter 3) to extract spatio-temporal local features for DT representation from two main aspects: based on dense trajectories (refer to Chapter 4) and based on filtered outcomes (refer to Chapters 5 and 6). Experiments have proved that our corre-spondingly proposed descriptors have small dimension (e.g., HoGF [J3], DoDGF [S1], etc.) while their performance has been close to that of the deep-learning-based approaches.

**Dictionary-learning-based techniques**

Another trend for DT representation is based on dictionary-learning-based techniques using sparse representation to learn DT features. In general, a sparse coding can be briefly presented as follows. Let D 2 Rn K be an over-complete dictionary matrix containing K atoms fd1; d2; :::; dK g. It is assumed that a vector y 2 Rn can be represented as a sparse linear combination of these atoms: either exactly as y = Dx or approximately as y Dx so that k y Dx kp , where typical norms used for measuring the deviation are the lp-norms. Therein, x 2 RK is a coefficient vector of y. Motivated by the sparse representation, Quan et al. [69] adopted K-SVD [70], an algorithm for training of dictionaries, to model a DT sequence by a set of space-time elements with certain distribution, where local DT features are structured from a dictionary learned via sparse representation from a set of local DT patches of the DT video, known as atoms. However, it is difficult to perform in multi-scale analysis as done in some of the geometry-based approaches [51, 52]. Learning multiple dictionaries with different size of atoms can be possible but it is hard to be efficient in the computing models. On the other side, Quan et al. [71] introduced equiangular kernel to learn a dictionary with optimal mutual coherence in computational feasibility. In terms of effectiveness of those in DT representation, experiments have shown that the dictionary-learning-based approaches have performed well in recognizing DTs on simple datasets (e.g., UCLA [5]), but not on the more complex ones (e.g., DynTex [54], DynTex++ [55]). In the meantime, our proposals in simple frameworks can significantly improve the discrimination power of DT descriptors as well as obtain much better rates in DT recognition.

**Filter-based methods**

As mentioned in Section 1.4, the filtering is one of crucial solutions to reduce noise and other factors which negatively impact on DT representation (see Figure 1.5 for a general view of DT encoding based on the filtering). In general, the filter-based methods have evinced their efficiency in performance of DT recognition. Experiments have illustrated that the former filter-based approaches have performed well on DT datasets with simple motions (e.g., UCLA [5]), while they either remain several limitations or have not been verified on challenging datasets (e.g., DynTex [54], DynTex++ [55]). Addressing the issue, our proposals in this thesis could thoroughly deal with this negative influence to form local-based descriptors with significant improvement of discrimination power. For taking a look of the filter-based methods, we can shortly arrange the filter-based methods into two categories as follows.

**DT description based on learned filters**

The main perception of this stream is to encode the filtered elements of a given video which are extracted by various learned filters. Arashloo et al. [72] exploited the binarized statistical image features (BSIF) [73] to produce learned BSIF filters. To this end, they used a generative model of Independent Component Analysis (ICA) to present a given image patch I through a vector r of unknown random variables and a feature matrix W as follows: I = Wr. The obtained BSIF filters were then applied to the orthogonal plane-images of a given video in order to form spatio-temporal BSIF-TOP descriptor for single-scale analysis of BSIF filters and MBSIF-TOP for the multi-scale one. In another approach, Zhao et al. [74] applied the CLBP’s concept [3] to encode the filtered responses produced by L learned filters. Accordingly, let W = [!1; !2; :::; !L] be L vectorized 3D filters. These 3D filters were learned by following different techniques: Principal component analysis (PCA) in [64, 75], ICA-based in [72, 73], Sparse filtering in [76], K-means clustering in [77]. For a zero-mean vector vk computed by a k k k square cube of a video, it could be obtained L filter responses subject to applying W to vk as follows: rk = W>vk. After that, CLBP-based components could be located for encoding these filtered outputs rk in order to structure B3DF descriptors for DT representation.

**DT description based on non-learned filters**

Contrast to the methods based on the learned filters, the main perception of this stream is to encode the filtered elements which are extracted by filterings of non-learning-based filters. It can be effortlessly realized that it can save computational cost due to no learning process related to the filterings. To the best of our knowledge, although there are many efforts using non-learned filters for textural image description (e.g., MRELBP [78], SBP [2], RAMBP [79], etc.), it has been very rarely for representation of DTs until our recent proposals. Specifically, Rivera et al. [38] proposed to use a Kirsch compass mask [80] for calculating the spatial directional response of a pre-defined neighborhood in eight different directions. Accordingly, for a given video V, each instance of 2D/3D Kirsch mask Mk is convolved on sub-regions of V’s plane-images for the 2D filtering and V for the 3D one in order to obtain Kirsch-based responses as follows: Vk = V Mk2D=3D. These outputs are then adapted to a graph model in order for capturing spatio-temporal features of directional number transitional graph (DNG) for DT description. In this thesis, we will propose to exploit different kinds of potential filterings applied to V in order to achieve robust filtered outcomes against the well-known problems for local DT encodings: filtering models of moment images [J1] and volumes [J5] (refer to Chapter 5 for further presentation), filterings based on Gaussian-based kernels [C2, C5, J4, S2, S4] and their derivations [J3, S1, S3, S5] (refer to Chapter 6 for further expression). Experiments in DT recognition issue have validated that our proposed descriptors have very good performance compared to state of the art. Some of them in very simple computation and small dimension have rates being close to those of deep-learning approaches. More significantly, ours could be expected as one of appreciated solutions for mobile applications in practice.

**Table of contents :**

Author’s publications

R´esum´e

Abstract

Acknowledgment

List of Figures

List of Tables

**1 Introduction **

1.1 Dynamic textures: definition, challenges, and applications

1.2 An overview of representing DTs based on dense trajectories

1.3 An overview of representing DTs based on moment-based features

1.4 An overview of representing DTs based on Gaussian-filtered features

1.5 Our main contributions

1.6 Outline of thesis

**2 Literature review **

2.1 Introduction

2.2 Optical-flow-based methods

2.2.1 A brief of optical-flow concept

2.2.2 Analyzing DTs based on optical flow

2.3 Model-based methods

2.3.1 Linear Dynamical Systems (LDS)

2.3.2 Modeling DTs based on LDS

2.4 Geometry-based methods

2.4.1 A brief of fractal analysis

2.4.2 DT representation based on fractal analysis

2.5 Learning-based methods

2.5.1 Deep-learning-based techniques

2.5.2 Dictionary-learning-based techniques

2.6 Filter-based methods

2.6.1 DT description based on learned filters

2.6.2 DT description based on non-learned filters

2.7 Local-feature-based methods

2.7.1 A brief of LBP

2.7.2 A completed model of LBP (CLBP)

2.7.3 Completed local structure patterns (CLSP), a variant of CLBP

2.7.4 LBP-based variants for textural image description

2.7.5 LBP-based variants for DT representation

2.8 Datasets and protocols for evaluations of DT recognition

2.8.1 UCLA dataset

2.8.2 DynTex dataset

2.8.3 DynTex++ dataset

2.8.4 DTDB dataset

2.9 Classifiers for evaluating DT representation

**3 Proposed variants of LBP-based operators **

3.1 Introduction

3.2 Completed AdaptIve Patterns (CAIP)

3.3 Some extensions of Local Derivative Patterns (xLDP)

3.3.1 Local Derivative Patterns

3.3.2 Adaptative directional thresholds

3.3.3 Completed model of LDP

3.3.4 Assessing our proposed extensions of LDP

3.4 Some extensions of local vector patterns (xLVP)

3.4.1 Local Vector Patterns

3.4.2 Adaptive directional vector thresholds

3.4.3 A completed model of LVP

3.5 Local Rubik-based Patterns (LRP)

3.5.1 Complemented components

3.5.2 Construction of LRP patterns

3.6 Completed HIerarchical LOcal Patterns (CHILOP)

3.6.1 Construction of CHILOP

3.6.2 A particular degeneration of CHILOP into CLBP

3.6.3 Beneficial properties of CHILOP operator

3.7 Summary

**4 Representation based on dense trajectories **

4.1 Introduction

4.2 Dense trajectories

4.3 Beneficial properties of dense trajectories

4.3.1 Directional features of a beam trajectory

4.3.2 Spatio-temporal features of motion points

4.4 Directional dense trajectory patterns for DT representation

4.4.1 Proposed DDTP descriptor

4.4.2 Computational complexity of DDTP descriptor

4.5 Experiments and evaluations

4.5.1 Experimental settings

4.5.2 Experimental results

4.5.2.1 Recognition on UCLA dataset

4.5.2.2 Recognition on DynTex dataset

4.5.2.3 Recognition on DynTex++ dataset

4.5.3 Global discussion

4.6 Summary

**5 Representation based on moment models **

5.1 Introduction

5.2 Moment models

5.2.1 Moment images

5.2.2 A novel moment volumes

5.2.3 Advantages of moment volume model

5.3 DT representation based on moment images

5.4 DT representation based on moment volumes

5.4.1 Proposed momental directional descriptor

5.4.2 Enhancing the performance with max-pooling features

5.5 Experiments and evaluations

5.5.1 Experimental settings

5.5.2 Assessment of effectiveness of moment models

5.5.3 Experimental results of MDP-based descriptors

5.5.3.1 Recognition on UCLA dataset

5.5.3.2 Recognition on DynTex dataset

5.5.3.3 Recognition on Dyntex++ dataset

5.5.3.4 Assessing the proposed components: Recognition with MDP-B and LDP-TOP

5.5.3.5 Assessing impact of max-pooling features: Recognition with EMDP descriptor

5.5.4 Global discussion

5.6 Summary

**6 Representation based on variants of Gaussian filterings **

6.1 Introduction

6.1.1 Motivation

6.1.2 A brief of our contributions

6.2 Gaussian-based filtering kernels

6.2.1 A conventional Gaussian filtering

6.2.2 Gradients of a Gaussian filtering kernel

6.3 A novel kernel based on difference of Gaussian gradients

6.3.1 Definition of a novel DoDG kernel

6.3.2 Beneficial properties of DoDG compared to DoG

6.4 Representation based on completed hierarchical Gaussian features

6.4.1 Construction of Gaussian-filtered CHILOP descriptor

6.4.2 Experiments and evaluations

6.4.2.1 Parameters for experimental implementation

6.4.2.2 Assessments of CHILOP’s performances

6.5 Representation based on RUbik Blurred-Invariant Gaussian features

6.5.1 Benefits of Gaussian-based filterings

6.5.2 Construction of RUBIG descriptor

6.5.3 Experiments and evaluations

6.5.3.1 Parameters for experimental implementation

6.5.3.2 Assessments of RUBIG’s performances

6.6 Representation based on Gaussian-filtered CAIP features

6.6.1 Completed sets of Gaussian-based filtered outcomes

6.6.2 Beneficial properties of filtered outcomes 2D=3D ;0

6.6.3 DT description based on complementary filtered outcomes 2D=3D ;0

6.6.4 Experiments and evaluations

6.6.4.1 Parameters for experimental implementation

6.6.4.2 Assessments of DoG-based features compared to those of FoSIG and V-BIG

6.6.4.3 Assessments of LOGIC2D=3D’s performances

6.7 Representation based on oriented magnitudes of Gaussian gradients

6.7.1 Oriented magnitudes of Gaussian gradients

6.7.2 DT representation based on oriented magnitudes

6.7.3 Experiments and evaluations

6.7.3.1 Parameters for experimental implementation

6.7.3.2 Assessments of effectiveness of decomposing models

6.7.3.3 Assessments of MSIOMFk;D4 and MSVOMFk;D4

6.8 Representation based on Gaussian-gradient features

6.8.1 High-order Gaussian-gradient Filtered Components

6.8.2 DT Representation Based on 2D=3D H; Components

6.8.3 Experiments and evaluations

6.8.3.1 Parameters for experimental implementation

6.8.3.2 Assessments of High-order Gaussian-gradient Descriptors

6.8.3.3 Comprehensive Comparison to Non-Gaussian-gradients

6.9 Representation based on DoDG-filtered features

6.9.1 Construction of DoDG-filtered descriptors

6.9.2 Experiments and evaluations

6.9.2.1 Parameters for experimental implementation

6.9.2.2 Assessments of DoDG-based descriptors

6.9.2.3 Comprehensive comparison to DoG-based descriptors

6.10 Comprehensive evaluations in comparison with existing methods

6.10.1 Benefits of Gaussian-based filterings

6.10.1.1 Robustness to the well-known issues of DT description

6.10.1.2 Rich and discriminative features of Gausian-gradient-based filterings .

6.10.2 Complexity of our proposed descriptors

6.10.3 Comprehensive discussions of DT classification on different datasets

6.10.3.1 Classification on UCLA

6.10.3.2 Classification on DynTex

6.10.3.3 Classification on DynTex++

6.10.3.4 Classification on DTDB dataset

6.11 Global discussions

6.11.1 Further evaluations for Gaussian-gradient-based descriptors

6.11.2 Evaluating appropriation of our proposals for real applications

6.12 Summary

**7 Conclusions and perspectives **

7.1 Conclusions

7.2 Perspectives

**Bibliography**