Impact of time-frequency representations, DNN architectures, and DNN training data

Get Complete Project Material File(s) Now! »

Deep neural networks (DNNs)

Studies in the last decade have shown that deep neural networks (DNNs), in-cluding recurrent neural networks (RNNs), can model complex functions and perform well on various tasks [Deng, 2014; LeCun et al., 2015; Schmidhuber, 2015; Juang, 2016; Goodfellow et al., 2016], including audio signal processing [Yu & Deng, 2011; Wang, 2017] and ASR [Hinton et al., 2012; Yu & Deng, 2015]. At the beginning of 2015 when our study was started, DNNs had also been applied to single-channel speech separation and music separation, especially singing voice separation [Weninger et al., 2014; Huang et al., 2015]. There were also studies about exploiting the available multichannel data, but they were limited to extracting representations, also known as features, from this data to estimate a single-channel separation filter [Jiang et al., 2014; Araki et al., 2015]. As a result, these studies do not fully exploit the benefits of multichannel data as achieved by multichannel filtering. Since then, there have been more studies on using DNNs for both single-channel and multichannel separation, although the single-channel case is still more popular. For the multichannel case, there have been studies on employing DNNs for doing beamforming [Kumatani et al., 2012]. The DNNs are used either (1) for implicitly estimating a time-invariant beamformer in the context of robust ASR joint training framework [Xiao et al., 2016; Sainath et al., 2017] or (2) for estimating a single-channel filter, known as a mask, for each channel and use the estimated filters to derive a time-invariant beamformer [Heymann et al., 2016; Erdogan et al., 2016; Wang et al., 2017]. Both approaches have shown to perform well for speech enhancement. However, time-invariant filtering might not be suitable for mixtures with many sources, whose mixing and stationarity cannot be assumed, such as a song with its vocals and diﬀerent musical instruments. In our study, the DNNs are used to estimate the spectral and spatial parameters of each source and use the estimated parameters to derive a time-varying multichannel filter. Further discussions about these studies and the positioning of our study with respect to these study are presented in the next chapter.

Automatic speech recognition (ASR)

Figure 2.2 shows the block diagram of a typical ASR system with optional enhancement blocks [Rabiner & Juang, 1993; Rabiner & Schafer, 2007; Gales & Young, 2008; Virtanen et al., 2012]. The basic flow involves the extraction of feature vectors from the input speech signal (in the feature extraction step) and the search for the most likely word sequence given these features (in the decoding step). The decoding step can be expressed as w = arg max P (w v) ; (2.7).
where wb and w are the estimated and the hypothesized word sequences; v is the sequence of feature vectors; and P (wjv) is the probability of a word sequence w given the sequence of feature vectors v. Following Bayes’ rule, this formula leads to w = arg max P (v w) P (w) ; (2.8).
where the likelihood P (vjw) is computed from an acoustic model and the prior P (w) is computed from a language model. The language model models word sequences and the acoustic model typically models small speech sound units called phonemes. In order to link the acoustic model and the language model, a lexicon, also known as a pronunciation dictionary, is used. The lexicon contains a list of words and the variations of pronunciation, expressed in phonemes, of these words.
There exist various advanced techniques, including enhancement tech-niques, for diﬀerent components of a robust ASR system [Virtanen et al., 2012; Li et al., 2014].
Signal related techniques works in the time or time-frequency domain of the single-channel or multichannel observed noisy speech [Benesty et al., 2005]. In general, these techniques are known as speech enhancement techniques, although they may involve noise reduction, dereverberation, or echo cancellation. The techniques include spectral subtraction [Boll, 1979], Wiener filtering [Lim & Oppenheim, 1979], non-negative matrix factorization (NMF) [Lee & Seung, 1999, 2000], and beamforming with post-filtering [Brandstein & Ward, 2001; Benesty et al., 2008]. Some of these methods are further discussed later in this chapter.
Feature related techniques include robust features, feature normalization, and feature enhancement. The most commonly used basic features are mel-frequency cepstral coeﬃcients [Davis & Mermelstein, 1980] and perceptual linear predictive coeﬃcients [Hermansky, 1990]. In order to compensate the noise, normalization techniques can be applied. These techniques may include cepstral mean normalization [Atal, 1974], cepstral mean and variance normalization [Viikki & Laurila, 1997], linear discriminant analysis [Izenman, 2008], maximum likelihood linear transform [Gopinath, 1998], and feature-space maximum like-lihood linear regression [Gales, 1998]. Features can also be improved by enhancement techniques, including SPLICE [Deng et al., 2001], ALGONQUIN [Frey et al., 2001], and various neural network based methods [Ishii et al., 2013; Wöllmer et al., 2013; Nugraha et al., 2014; Himawan et al., 2015; Fujimoto & Nakatani, 2016]. Finally, there exist various robust features, including some which are motivated by human auditory properties [Stern & Morgan, 2012].

Time-frequency representation

Audio source separation methods typically operate in the time-frequency domain, in which the temporal and spectral characteristics of sound can be jointly represented. Sounds tend to be sparsely distributed in this domain.
The most commonly used time-frequency representation is the short-time Fourier transform (STFT) [Allen, 1977; Smith, 2011; Virtanen et al., 2017]. Other representations include the Mel scale [Stevens et al., 1937] and the equivalent rectangular bandwidth (ERB) scale [Glasberg & Moore, 1990] representations. These two representations use diﬀerent perceptually-motivated nonlinear frequency scales. In this section, we only describe the STFT representation.
STFT analysis refers the computation of the time-frequency representation from the time-domain waveform. It is done by creating overlapping frames along the waveform and applying the discrete Fourier transform on each frame. Given channel i of the time domain mixture xi(t), the signal of frame index n 2 f0; 1; : : : ; N 1g expressed as xi(t; n) = xi(t + nH)ha(t); t 2 f0; 1; : : : ; T 1g; (2.10)
where N is the number of frames, H the hop size between adjacent frames, T the frame length, and ha(t) the analysis window, such as a Hamming or Hanning function. The application of the discrete Fourier transform on each frame results in the time-frequency representation Tf 1 Xt 0 ; f 2 f0; 1; : : : ; F = dF 0=2e; : : : ; F 0 1g; xi(f; n) = xi(t; n)e|2 tf=F =0 (2.11).

READ Governance Engineering

State-of-the-art single-channel audio source sep-aration

This section presents essential single-channel audio source separation meth-ods, including time-frequency masking, NMF, and various DNN based approaches. Since we consider the single-channel case, i.e., I = 1, to be concise, the index i is not shown in the notations.
There are notable methods beyond the ones particularly discussed here, such as factorial hidden Markov model. It relies on a statistical model of the sources [Roweis, 2003; Ozerov et al., 2009]. Each source is modeled by a GMM-HMM trained on the corresponding source data. The models are then used for estimating a time-frequency mask (see Section 2.4.1).

DNN based single-channel audio source separation

In this subsection, we present the basics of DNNs and review some single-channel source separation techniques employing DNNs. Most of the techniques use DNNs in the context of time-frequency masking (Section 2.4.1) by estimating a mask directly or estimating source spectra, from which a mask can be derived. A few others use DNNs in the context of NMF (Section 2.4.2).

Basics of DNNs

An artificial neuron is a computational model inspired by the biological neuron. Artificial neurons can be interconnected to form an artificial neural network. Hereafter, both the terms neuron and neural network refer to the artificial ones. There are three aspects in designing a neural network: the neuron, the architecture, and the learning [Rojas, 1996, chap. 1].
The neuron aspect describes how the inputs are processed. It typically follows the McCulloch-Pitts model [McCulloch & Pitts, 1943] as shown by Figure 2.3. Mathematically, it can be expressed as h = X wnxn + b! ; (2.23).
which says that the output h is obtained by applying a non-linear activation function to the aﬃne transformation of the inputs xn; n 2 f1; 2; 3g given the neuron parameters, that are the weights wn; n 2 f1; 2; 3g and possibly the bias b. The bias may be used to provide an activation threshold such that when the weighted sum of inputs is less than the bias, the neuron is not activated.
In the past, the sigmoid, sigm(z) = (1 + ez )1 , and the hyperbolic tangent, tanh(z) = (1 e2z )(1 + e2z )1 , were the prominent non-linear activation functions . Recently, various non-linear functions have been studied, such as the hard sigmoid [Gulcehre et al., 2016], which is a piece-wise linear approximation of the sigmoid in order to achieve a faster computation and implemented as hsig(z) = max(0; min(1; sz + 0:5)), where s is a slope parameter; the rectifier [Nair & Hinton, 2010], rect(z) = max(0; z); and the softplus [Dugas et al., 2000], sofp(z) = ln(1 + ez), which can be seen as a smooth approximation of the rectifier. Neurons with rectifier function are also known as rectified linear units. Additionally, there also exist other types of neurons which do not have any parameters, such as the ones for computing the mean of the inputs (known as average pooling operation), the ones for taking the maximum value among the inputs (known as max pooling), and the ones for multiplying the inputs, as in sum-product networks [Poon & Domingos, 2011].

Table of contents :

1 Introduction
1.1 Motivation
1.1.1 Audio source separation
1.1.2 Speech and music separations
1.1.3 Single-channel and multichannel separation
1.1.4 Deep neural networks (DNNs)
1.2 Objectives and scope
1.3 Contributions and organization of the thesis
2 Background
2.1 Audio source separation
2.1.1 Sources and mixture
2.1.2 Source separation
2.2 Automatic speech recognition (ASR)
2.3 Time-frequency representation
2.4 State-of-the-art single-channel audio source separation
2.4.1 Time-frequency masking
2.4.2 Non-negative matrix factorization (NMF)
2.4.3 DNN based single-channel audio source separation
2.4.3.1 Basics of DNNs
2.4.3.2 DNN based separation techniques
2.5 State-of-the-art multichannel audio source separation
2.5.1 Beamforming
2.5.2 Expectation-maximization (EM) based multichannel audio source separation framework
2.5.2.1 Multichannel Gaussian model
2.5.2.2 General iterative EM framework
2.5.3 DNN based multichannel audio source separation techniques
2.5.3.1 Utilizing multichannel features for estimating a single-channel mask
2.5.3.2 Estimating intermediate variables for deriving a multichannel filter
2.5.3.3 Directly estimating a multichannel filter
2.5.3.4 Summary
2.6 Positioning of our study
3 Estimation of spectral parameters with DNNs
3.1 Research questions
3.2 Iterative framework with spectral DNNs
3.3 Experimental settings
3.3.1 Task and dataset
3.3.2 An overview of the speech enhancement system
3.3.3 DNN spectral models
3.3.3.1 Architecture
3.3.3.2 Inputs and outputs
3.3.3.3 Training criterion
3.3.3.4 Training algorithm
3.3.3.5 Training data
3.4 Source spectra estimation
3.5 Impact of spatial parameter updates
3.6 Impact of spectral parameter updates
3.7 Comparison to NMF based iterative EM algorithm
3.7.1 Source separation performance
3.7.2 Speech recognition performance
3.8 Impact of environment mismatches
3.9 Summary
4 On improving DNN spectral models
4.1 Research questions
4.2 Cost functions for spectral DNN
4.2.1 General-purpose cost functions
4.2.2 Task-oriented cost functions
4.3 Impact of the cost function
4.3.1 Experimental settings
4.3.2 Source separation performance
4.3.3 Speech recognition performance
4.4 Impact of time-frequency representations, DNN architectures, and DNN training data
4.4.1 Experimental settings
4.4.1.1 Time-frequency representations
4.4.1.2 DNN architectures and inputs
4.4.1.3 DNN training criterion, algorithm, and data .
4.4.1.4 Multichannel filtering
4.4.2 Discussions
4.5 Impact of a multichannel task-oriented cost function
4.5.1 Experimental settings
4.5.1.1 Task and dataset
4.5.1.2 An overview of the singing-voice separation system
4.5.1.3 DNN spectral models
4.5.2 Discussions
4.5.2.1 Task-oriented cost function
4.5.2.2 Comparison with the state of the art
4.5.2.3 Data augmentation
4.6 Summary
5 Estimation of spatial parameters with DNNs
5.1 Research questions
5.2 Weighted spatial parameter updates
5.3 Iterative framework with spectral and spatial DNN
5.4 Experimental settings
5.4.1 Task and dataset
5.4.2 An overview of the speech enhancement system
5.4.3 DNN spectral models
5.4.3.1 Architecture, inputs, and outputs
5.4.3.2 Training criterion, algorithm, and data
5.4.4 DNN spatial models
5.4.4.1 Architecture, input, and outputs
5.4.4.2 Training algorithm and data
5.4.5 Design choices for the DNN spatial models
5.4.5.1 Cost functions
5.4.5.2 Architectures and input variants
5.5 Estimation of the oracle source spatial covariance matrices
5.6 Spatial parameter estimation with DNN
5.7 Impact of different spatial DNN architectures
5.8 Impact of different spatial DNN cost functions
5.9 Comparison with GEV-BAN beamforming
5.10 Summary
6 Conclusions and perspectives
6.1 Conclusions
6.2 Perspectives
Bibliography