ABE with memory inclusion using semi-supervised stacked autoencoders

Get Complete Project Material File(s) Now! »

ABE approaches based on source-filter model

The model-based ABE algorithms use a priori knowledge about characteristics of speech signals and the human speech production mechanism. Since beginning of the nineties, most ABE algorithms started exploiting the classical source-filter model [80] of speech production where a NB speech signal is represented by an excitation source and a vocal tract filter. The frequency content of these two components can be extended through independent processing before a WB signal is resynthesised. The extension of NB speech is thus divided into two tasks; (1) estimation of HB or WB spectral envelope from input NB features via some form of an estimation technique and (2) generation of HB or WB excitation components via some form of time-domain non-linear processing, spectral translation or spectral shifting methods. The HB component is usually parametrised with some form of linear prediction (LP) coefficients whereas the NB component is parametrised by a variety of static and/or dynamic features.

Extension of spectral envelope

In practice most approaches focus on the extension of the spectral envelope since it has the dominant impact on speech quality. Different techniques such as linear and codebook mapping, Gaussian mixture models (GMMs), hidden-Markov models (HMMs) and deep neural networks (DNNs) are used for estimation. The use of many different feature representations for NB and HB components has been reported, e.g. linear prediction coefficients (LPCs) [81, 82], line spectral frequencies (LSFs) [83,84] and Mel-frequency cepstral coefficients (MFCCs) [85]. A mixed approach reported in [16] uses NB auto-correlation coefficients to estimate HB cepstral coefficients. Some additional features are also added to the NB feature set to improve estimation performance [78, 86, 87].

Codebook mapping

The very first approaches to ABE use codebook mapping [88, 89, 90, 91, 92] method for estimation. It involves training of two codebooks. A primary codebook is trained on NB feature vectors x via vector quantisation (VQ), e.g, using the well-known LBG algorithm [93]. This is equivalent to clustering of the training vectors into N clusters where centroids of the clusters form entries of the primary codebook. For every entry in the primary codebook, average of corresponding WB vectors y form the corresponding entry in the shadow codebook. During extension, for each NB speech frame, NB feature vector x is compared with the primary codebook entries.
The closest entry is selected and corresponding entry in the shadow codebook gives the estimated WB spectral envelope. Some methods use interpolation methods to improve performance of the codebook based ABE approaches. In this case, instead of choosing one WB envelope from the shadow codebook, the weighted sum of all or most probable codebook entries is used. The use of split codebook is also reported where separate codebooks for voiced and unvoiced frames are used given that the voiced and unvoiced spectral envelopes are characterised by different shapes [94].
The performance of the codebook mapping methods depends on the sizes of the codebooks. Higher the number of entries in the codebooks, better is the estimation performance, however, at the cost of increased memory required to save the codebooks. The performance also strongly depends on the choice of features x and y. More details on codebook mapping can be found in [95, Section 6.6] and [92, Section 3.1].

ABE approaches based on direct modelling of spectra

Some ABE approaches (which are not based on source-filter model) operate directly on the higher dimensional complex speech spectra.
Some handful of ABE approaches [138, 139, 140] directly operate on DFT coefficients in which first missing frequency components are generated by a simple non-linear operation, e.g., spectral mirroring. The generated HB frequency components are then spectrally shaped by a set of parameters. An adaptive spline neural network (ASNN) is employed in [141] to directly map the NB DFT coefficients to the missing HB coefficients. The work in [138] combines the use of spline interpolation and a neural network to estimate a set sub band powers; the parameters which are then used to adaptively tune the spectral shape of the missing HB. A spectral magnitude shaping curve (defined by five control points) is constructed or learned using cubic spline interpolation [142] and neuro-evolutive neural network [139]; the curve is then used to shape the magnitude spectral coefficients of HB components.
Further improvements to the ABE approach in [142] are obtained by a more accurate control of the HB spectral shape which are reported in [140, 143]. Other approaches directly estimate the spectral coefficients of the missing HB via statistical models. Magnitude of log-power spectrum (LPS) coefficients for missing HB are estimated using a DNN in [144]. ABE estimation performance using a DNN improved via the use of rich acoustic features for input NB representation; the oversmoothing problem of DNNs is reduced using global variance equalisation as a post-processing technique in [145]. The work in [146] uses deep bidirectional Long Short Term Memory (BLSTM)-based RNN. The target HB spectral features are generated through a weighted linear combination of real target exemplars; the method is used as a post-processing step to reduce the estimation errors2. Inspired by their use in automatic speech recognition (ASR), the ABE approach reported in [147] exploits the BN features [148] that are extracted using a DNN-based classifier to capture the linguistic information from the input NB speech. A deep LSTM-based RNN is then employed for estimation of HB components from these BN features which are supposed to capture phone-dependent characteristics and energy distributions of HB spectra.

End-to-end approaches to ABE

Convolutional neural networks (CNNs) are capable of extracting useful features by directly operating on raw speech waveforms. Inspired by the success of WaveNet [153] and dilated convolutional architectures [154, 155], an approach to ABE is proposed in [156] using stacked dilated CNNs. The method avoids spectral analysis and phase modelling issues via direct modelling and generation of time-domain speech waveforms. The NB speech signals (at a sampling rate of 16kHz) are first fed to dilated CNNs to generate either HB or WB speech signals. The generated waveforms are then added to the available NB signals after appropriate highpass filtering. Modelling of HB speech waveforms at output of the CNNs was found to be more effective than that of WB waveforms. Inspired by SampleRNN architecture [157], the use of Hierarchical recurrent neural networks (HRNNs) composed of LSTM cells and feedforward layers is investigated in [158] for ABE. A comparison of several waveform modelling techniques is presented. HRNNs were shown to achieve better speech quality and run-time efficiency than the dilated CNNs. The major drawback of the waveform-based ABE methods is low run-time efficiency; they are time-consuming during generation of speech samples.

READ The Central Topic - Service Interface Descriptions (APIs)

ABE with modified loss functions

Most ABE approaches usually employ standard mean-square error (MSE) criterion for optimisation which leads to over-smoothing problem. This is because the MSE loss function is minimised by averaging all plausible outputs. A regression model trained with a MSE loss function thus performs reasonably well in the average sense, however it fails to model the energy dynamics of different voiced and unvoiced sounds. Generative adversarial networks (GANs) [163] provide an alternative to the MSE loss function via adversarial learning wherein the HB features produced by a generator network are compared against the true HB features and classified as real or fake by a discriminator network. The goal of the adversarial learning is thus to make generated HB features indistinguishable from the true HB features and thereby producing perceptually better samples. The first application of GANs to ABE is reported in [163]. The work in [164] showed further improvements in ABE performance via the use of conditional GANs [165]. The approach in [166] employs GANs with stabilised training procedure – by adding penalty on the weighted gradient-norms of the discriminator network (proposed in [167]) – for ABE4. Another variant of GANs, also referred to as cycleGANs [168], trained with cycle loss is explored in [67] for the application of ASR.

Table of contents :

Abstract
List of Abbreviations
List of Figures
List of Tables
1 Introduction
1.1 Evolution of communication systems
1.1.1 Analog and digital telephony
1.1.2 Wireless cellular networks
1.2 Speech production
1.2.1 Speech sounds
1.2.2 Spectral characteristics of speech sounds
1.2.3 Effect of bandwidth on speech quality and intelligibility
1.3 Speech coding
1.3.1 Narrowband coding
1.3.2 Wideband coding
1.3.3 Super-wideband or full band coding
1.4 Artificial bandwidth extension
1.4.1 Non-blind methods
1.4.2 Blind methods
1.4.3 Motivation and applications
1.5 Super-wide bandwidth extension
1.6 Contributions
1.7 Outline of the thesis
2 Literature survey
2.1 Non-model based ABE approaches
2.2 ABE approaches based on source-filter model
2.2.1 Extension of spectral envelope
2.2.2 Extension of excitation
2.3 ABE approaches based on direct modelling of spectra
2.4 End-to-end approaches to ABE
2.5 ABE with modified loss functions
2.6 Feature selection and memory inclusion for ABE
2.6.1 Feature selection
2.6.2 Memory inclusion
2.7 Evaluation of speech quality
2.7.1 Assessement of different ABE algorithms
2.8 Approaches to super-wide bandwidth extension (SWBE)
2.8.1 SWBE for audio signals (speech and music)
2.8.2 SWBE for speech only
2.9 Summary
3 Baseline, databases and metrics
3.1 ABE algorithm
3.1.1 Training
3.1.2 Estimation
3.1.3 Resynthesis
3.2 Databases
3.2.1 TIMIT
3.2.2 TSP speech database
3.2.3 CMU-Arctic database
3.2.4 3GPP database
3.3 Data pre-processing and distribution
3.3.1 Data pre-processing
3.3.2 Training, validation and test data
3.4 Performance assessment
3.4.1 Subjective assessment
3.4.2 Objective assessment metrics
3.4.3 Mutual information assessment
4 ABE with explicit memory inclusion
4.1 Memory inclusion for ABE
4.2 Brief overview of memory inclusion for ABE via delta features: Past work
4.2.1 Memory inclusion scenarios
4.2.2 Highband certainty
4.2.3 Analysis and results
4.2.4 Discussion
4.3 Assessing the benefit of explicit memory to ABE
4.3.1 Analysis
4.3.2 Findings
4.3.3 Need for dimensionality reduction
4.4 ABE with explicit memory inclusion
4.4.1 Training
4.4.2 Estimation
4.4.3 Resynthesis
4.5 Experimental setup and results
4.5.1 Implementation details and baseline
4.5.2 Objective assessment
4.5.3 Subjective assessment
4.5.4 Mutual information assessment
4.5.5 Discussion
4.6 Summary
5 ABE with memory inclusion using semi-supervised stacked autoencoders
5.1 Unsupervised dimensionality reduction
5.1.1 Principal component analysis
5.1.2 Stacked auto-encoders
5.2 ABE using semi-supervised stacked auto-encoders
5.2.1 Semi-supervised stacked auto-encoders
5.2.2 Application to ABE
5.3 Experimental setup
5.3.1 SSAE training, configuration and optimisation
5.3.2 Databases and metrics
5.4 Results
5.4.1 Speech quality assessment
5.4.2 Mutual information assessment
5.5 Summary
6 Latent representation learning for ABE
6.1 Variational auto-encoders
6.1.1 Variational lower bound
6.1.2 Reparameterisation trick
6.1.3 Relation to conventional auto-encoders
6.1.4 VAEs for real valued Gaussian data
6.2 Conditional variational auto-encoders
6.3 Application to ABE
6.3.1 Motivation
6.3.2 Extracting latent representations
6.3.3 Direct estimation using CVAE-DNN
6.4 Experimental setup and results
6.4.1 CVAE configuration and training
6.4.2 Analysis of weighting factor
6.4.3 Objective assessment
6.4.4 Subjective assessment
6.5 Summary
7 Super-wide bandwidth extension
7.1 Motivation
7.2 Past work
7.3 Super-wide bandwidth extension (SWBE)
7.3.1 High frequency component estimation
7.3.2 Low frequency component upsampling
7.3.3 Resynthesis
7.4 Spectral envelope extension
7.4.1 Effect of sampling frequency
7.4.2 Extension
7.4.3 Comparison
7.5 Experimental setup and results
7.5.1 Databases
7.5.2 Data pre-processing
7.5.3 Assessment and baseline algorithm
7.5.4 Objective assessment
7.5.5 Subjective assessment
7.5.6 Discussion
7.6 Summary
8 Conclusions and future directions
8.1 Contributions and conclusions
8.2 Future directions
Bibliography