Non-diagonal smoothed shrinkage for robust audio denoising

Get Complete Project Material File(s) Now! »

Overview of single microphone speech enhancement system

In audio and speech enhancement, one of the most important tasks is the removal or reduction of background noise from a noisy signal. The observed signal is frequently segmented, windowed and transformed into a representation domain. Then, the clean signal coefficients are usually retrieved by applying an enhancement algorithm to the noisy observations in this domain. Figure 2.1 shows a basic single channel speech enhancement system block diagram. A single microphone system consists of four blocks: Decomposition, Noise Estimation, Noise Reduction Algorithm and Reconstruction Blocks, respectively. In short, the process is performed as follows. First, the noisy signal y[n] is decomposed using a short time harmonic transform (STHT) in the decomposition block. Second, the time-frequency noisy coefficient Y [m, k] is modified to obtain the enhanced coefficient b S[m, k] in the noise reduction block. Note that the noise estimation block provides the noise power spectrum bσ 2X [m, k], which is an important input of the noise reduction block.
Finally, the enhanced signal bs[n] is synthesized from the enhanced time-frequency coefficient b S[m, k] in the reconstruction block. Specially, we used the Hamming window and 50% overlapadd method in implementation for all the algorithms in this thesis. We now describe the role of each block in detail in the following sub-sections.

Decomposition block

The noisy observed signal is segmented, windowed and transformed by a computational harmonic transform in the decomposition block. In fact, most but not all of speech enhancement algorithms proceed in the time-domain but rather in a transformed domain where the separate between clean signal and noise is made easier. As mentioned above, we concentrate on speech enhancement scenario where noise is uncorrelated and additive. Therefore, the noisy signal is modeled by y[n] = s[n] + x[n], where s and x are respectively the clean signal and independent noise in the time domain and n = 0, 1, . . . ,N − 1 is the sampling time index. Most enhancement algorithms operate on frame-by-frame where only a finite collection of observation y[n] is available. A time-domain window w[n] is usually applied to the noisy signal, yielding the windowed signal as: yW[n] = y[n]w[n]. (2.1).
In frame-based signal processing, the shape of window is obtained by trading-off between smearing
and leakage effects [69]. The second parameter is the window length, which allows to trade-off between spectral resolution and statistical variance. In speech enhancement, if the length of window is too large, can no longer speech be considered stationary within a frame. On the other hand, if the length is too small, the spectral solution may not be accurate enough. Based on previous consideration, Hanning and Hamming window functions are often chosen to truncate the signal of interest in the considered frame. The shape of these windows is illustrated in Figure 2.2. In this thesis, we prefer the Hamming window function, which does not vanish to zero at the end. The Hamming window function is defined as follow:

Noise estimation block

The noise estimation block aims at estimating the power spectrum σ2X [m, k] = E[kX[m, k]k2]. Therefore, the noise estimation is the main block where various techniques have been proposed. In this section, we discuss only some general points for completeness. For further detail about noise estimation, readers are invited to consult Chapter 3 in Part II. Most noise estimation algorithms are based on the following assumptions [1, Chapter 9]:
(A1) As mentioned above, the speech signal is degraded by a statistically independent additive noise.
(A2) Speech is not always present. Thus, we can always find an analysis segment, formed by some consecutive frames, that contains speech-pause or noise-only.
(A3) Noise is more stationary than clean speech so that we can assume that noise remains stationary within a given analysis segment.
As an example, we will detail one of the first noise power spectrum estimation based on minimumstatistic (MS) [70]. This algorithm tracks the minimum value of the noisy speech power spectrum within an analysis segment. For the reason that noise and speech are statistically independent (A1), the periodogram of noisy speech is approximated as: kY [m, k]k2 ≈ kX[m, k]k2 + kS[m, k]k2 .

Noise reduction block

Once the noise power spectrum estimation is obtained, in single microphone system, a noise reduction algorithm is used for retrieving the enhanced signal b S[m, k]. Like the noise estimation block, in this section, for the sake of self-completeness, we chose to present one of the first noise reduction method, which is computationally efficient [9] and called the power spectral subtraction algorithm. Further details will be given in the following chapters. For most noise reduction algorithms, we can define a gain function G[m, k] for which the enhanced amplitude of the signal of interest AbS[m, k] is obtained as follows: AbS[m, k] = G[m, k]AY [m, k]

Table of contents :

Remerciement
Résumé en Français
Abstract
Résumé
Acronyms
List of Figures
List of Tables
I Introduction
1 Introduction
1.1 Context of the thesis
1.2 A brief history of speech enhancement
1.2.1 Unsupervised methods
1.2.2 Supervised methods
1.3 Thesis motivation and outline
2 Single microphone speech enhancement techniques
2.1 Introduction
2.2 Overview of single microphone speech enhancement system
2.2.1 Decomposition block
2.2.2 Noise estimation block
2.2.3 Noise reduction block
2.2.4 Reconstruction block
2.3 Performance evaluation of speech enhancement algorithms
2.3.1 Objective tests
2.3.2 Mean opinion scores subjective listening test
2.4 Conclusion
II Noise: Understanding the Enemy
3 Noise estimation block
3.1 Introduction
3.2 DATE algorithm
3.3 Weak-sparseness model for noisy speech
3.4 Noise power spectrum estimation by E-DATE
3.4.1 Stationary WGN
3.4.2 Colored stationary noise
3.4.3 Extension to non-stationary noise: The E-DATE algorithm
3.4.4 Practical implementation of the E-DATE algorithm
3.5 Performance evaluation
3.5.1 Number of parameters
3.5.2 Noise estimation quality
3.5.3 Performance evaluation in speech enhancement
3.5.4 Complexity analysis
3.6 Conclusion
III Speech: Improving you
4 Spectral amplitude estimator based on joint detection and estimation
4.1 Introduction
4.2 Signal model in the DFT domain
4.3 Strict presence/absence estimators
4.3.1 Strict joint STSA estimator
4.3.2 Strict joint LSA estimator
4.4 Uncertain presence/absence estimators
4.4.1 Uncertain joint STSA detector/estimator
4.4.2 Uncertain joint LSA estimator
4.5 Experimental results
4.5.1 Database and Criteria
4.5.2 STSA-based results
4.5.3 LSA-based results
4.6 Conclusion
5 Non-diagonal smoothed shrinkage for robust audio denoising
5.1 Introduction
5.1.1 Motivation and organization
5.1.2 Signal model and notation in the DCT domain
5.1.3 Sparse thresholding and shrinkage for detection and estimation
5.2 Non-diagonal audio estimation of Discrete Cosine Coefficients
5.2.1 Non-parametric estimation by Block-SSBS
5.2.2 MMSE STSA in the DCT domain
5.2.3 Combination method
5.3 Experimental Results
5.3.1 Parameter adjustment
5.3.2 Speech data set
5.3.3 Music data set
5.4 Conclusion
IV Conclusion
6 Conclusions and Perspectives
6.1 Conclusion
6.2 Perspectives
A Lemma of the integral optimization problem
B Detection threshold under joint detection and estimation
B.1 Strict model
B.2 Uncertain model
B.2.1 Independent estimators
B.2.2 Joint estimator
C Semi-parametric approach
C.1 The unbiased estimate risk of block for Block-SSBS
C.2 The MMSE gain function in the DCT domain
D Author Publications
Bibliography