(Downloads - 0)
For more info about our services contact : help@bestpfe.com
Table of contents
1 Introduction
1.1 Motivation
1.1.1 Audio source separation
1.1.2 Speech and music separations
1.1.3 Single-channel and multichannel separation
1.1.4 Deep neural networks (DNNs)
1.2 Objectives and scope
1.3 Contributions and organization of the thesis
2 Background
2.1 Audio source separation
2.1.1 Sources and mixture
2.1.2 Source separation
2.2 Automatic speech recognition (ASR)
2.3 Time-frequency representation
2.4 State-of-the-art single-channel audio source separation
2.4.1 Time-frequency masking
2.4.2 Non-negative matrix factorization (NMF)
2.4.3 DNN based single-channel audio source separation
2.4.3.1 Basics of DNNs
2.4.3.2 DNN based separation techniques
2.5 State-of-the-art multichannel audio source separation
2.5.1 Beamforming
2.5.2 Expectation-maximization (EM) based multichannel audio source separation framework
2.5.2.1 Multichannel Gaussian model
2.5.2.2 General iterative EM framework
2.5.3 DNN based multichannel audio source separation techniques
2.5.3.1 Utilizing multichannel features for estimating a single-channel mask
2.5.3.2 Estimating intermediate variables for deriving a multichannel filter
2.5.3.3 Directly estimating a multichannel filter
2.5.3.4 Summary
2.6 Positioning of our study
3 Estimation of spectral parameters with DNNs
3.1 Research questions
3.2 Iterative framework with spectral DNNs
3.3 Experimental settings
3.3.1 Task and dataset
3.3.2 An overview of the speech enhancement system
3.3.3 DNN spectral models
3.3.3.1 Architecture
3.3.3.2 Inputs and outputs
3.3.3.3 Training criterion
3.3.3.4 Training algorithm
3.3.3.5 Training data
3.4 Source spectra estimation
3.5 Impact of spatial parameter updates
3.6 Impact of spectral parameter updates
3.7 Comparison to NMF based iterative EM algorithm
3.7.1 Source separation performance
3.7.2 Speech recognition performance
3.8 Impact of environment mismatches
3.9 Summary
4 On improving DNN spectral models
4.1 Research questions
4.2 Cost functions for spectral DNN
4.2.1 General-purpose cost functions
4.2.2 Task-oriented cost functions
4.3 Impact of the cost function
4.3.1 Experimental settings
4.3.2 Source separation performance
4.3.3 Speech recognition performance
4.4 Impact of time-frequency representations, DNN architectures, and DNN training data
4.4.1 Experimental settings
4.4.1.1 Time-frequency representations
4.4.1.2 DNN architectures and inputs
4.4.1.3 DNN training criterion, algorithm, and data .
4.4.1.4 Multichannel filtering
4.4.2 Discussions
4.5 Impact of a multichannel task-oriented cost function
4.5.1 Experimental settings
4.5.1.1 Task and dataset
4.5.1.2 An overview of the singing-voice separation system
4.5.1.3 DNN spectral models
4.5.2 Discussions
4.5.2.1 Task-oriented cost function
4.5.2.2 Comparison with the state of the art
4.5.2.3 Data augmentation
4.6 Summary
5 Estimation of spatial parameters with DNNs
5.1 Research questions
5.2 Weighted spatial parameter updates
5.3 Iterative framework with spectral and spatial DNN
5.4 Experimental settings
5.4.1 Task and dataset
5.4.2 An overview of the speech enhancement system
5.4.3 DNN spectral models
5.4.3.1 Architecture, inputs, and outputs
5.4.3.2 Training criterion, algorithm, and data
5.4.4 DNN spatial models
5.4.4.1 Architecture, input, and outputs
5.4.4.2 Training algorithm and data
5.4.5 Design choices for the DNN spatial models
5.4.5.1 Cost functions
5.4.5.2 Architectures and input variants
5.5 Estimation of the oracle source spatial covariance matrices
5.6 Spatial parameter estimation with DNN
5.7 Impact of different spatial DNN architectures
5.8 Impact of different spatial DNN cost functions
5.9 Comparison with GEV-BAN beamforming
5.10 Summary
6 Conclusions and perspectives
6.1 Conclusions
6.2 Perspectives
Bibliography



