Single-channel source separation may be useful in many di erent scenarios :
• Fine level decomposition : for each instrument, each note must be extracted as an individual source. In contrast with polyphonic music transcription, recovering source estimates may be useful in this case to perform modi – cations such as modifying notes or changing the length of a note without changing that of the others.
• Instrument per instrument separation : this is the kind of task proposed in separation campaigns such as MIR-1K or SISEC. While there may be several instruments, in popular songs those are highly synchronized. Due to its particular role in Western popular music, it is of particular interest to extract voice from the accompaniment. It might be also interesting to extract solo instruments (e.g., electric guitar in rock and roll recordings).
• \Nonstationary » denoising : while denoising has been a successful applica-tion of signal processing since the 1970’s, it is limited to stationary noise e.g., ambient noise in room recordings, the spectral properties of which are considered as constant throughout time. In eld recordings, on the con-trary, or in real-life situations, what we commonly experience as noise may have time-varying spectral properties : wind, tra c, water trickling, etc. In this case, denoising is no longer an appropriate term and source separation methods should be used.
• Source separation can also be used as a pre-processing step for other tasks, such as automatic speech recognition (ASR). When there are sources of interference (other voices for instance), source separation could be used to clean the target speech signal from interferences and feed it to a standard recognizer. This approach to speech enhancement is used by many partici-pants in the Pascal Chime Challenge and allows achieving recognition rates that are close to that of human hearing.
Source separation databases
• SISEC3 : held in 2008,2010, 2011. Several tasks proposed, including Pro-fessionally Produced Music Recordings. Short 10 seconds’ excerpts from songs are provided as well as original sources as train data. For test data, other songs were released in full. This data set is particularly challenging because instruments change entirely from train to test. Moreover, train-ing samples are very short (10 seconds), so that sophisticated methods are likely to over t.
• QUASI4 : released in 2012, is very similar to the Professionally Produced Music Recordings task of SISEC, but provides more training data.
• Chime challenge5 : held in 2011, it aims at evaluating the performance of automatic speech recognizers (ASR) in real-world environments. Sen-tences from the grid corpus are mixed with environmental noise. There are several hours of training data for environmental noise and clean speech. In this setting, source separation algorithms were key in providing reliable estimates of clean speech on test data.
• NOIZEUS dataset6 : this corpus has thirty short English sentences (each about three seconds long) spoken by three female and three male speakers.
Time-frequency representations of sound signals
In this short section we present as much mathematical content as necessary for the reader to understand their essential properties, in particular the short time Fourier Transform. A complete presentation of time-frequency representations can be found in books such as [Kovacevi et al., 2012, Mallat, 2008].
Given a time signal x 2 RT , the Fourier transform of x is de ned as : x^k = xt exp( i ) k = 0 : : : T 1 : (1.4) T =0 x^ has Hermitian symmetry around T =2: 8k; x^T k = x^k? k = 0 : : : T 1 : (1.5)
Hermitian symmetry compensates the fact that x^ lives in CT which has twice as many dimensions as RT .
Coe cient k of the Fourier transform describes the contribution of a sinusoid at frequency fs k=T , from 0Hz to the Shannon-Nyquist rate fs=2Hz. In sound signals, coe cient k = 0 is always zero, and Fourier coe cients decay fast with k, provided there are no discontinuities in the signal (which happens at notes onset and o set in music signals, at the beginning and end of utterances in speech, and so on).
The Fourier transform is invertible and conserves energy, i.e., 8x; kxk2 = kx^k2 (1.6) so the contribution of all sinusoids is su cient to describe the whole signal. Note that a circular translation of x does not change the modulus of its
Fourier coe cients : thus if we keep only magnitude coe cients jx^j we obtain a translation invariant representation of x.
It is particularly relevant for sound signals at time scales of the order of a few tens of milliseconds, who typically have few nonzero coe cients in the Fourier domain, but are very dense in the time domain. This sparsity property is exploited in lossy compression schemes such as AAC.
On the other hand, at larger time scales, localized structures such as sequences of vowels in speech or of notes in music cannot be described by the Fourier transform, while they are easily discriminable in the time domain. Trading o time localization versus frequency localization is at the heart of time-frequency representations of signal, as we shall now see.
Short time Fourier transform
Time-frequency representations are a compromise between time and frequency localized properties of sound signals. In this thesis, we use a well-known time-frequency representation, called the short time Fourier Transform (STFT).
The STFT is computed in two steps : rst the signal is cut into smaller pieces and multiplied by a smoothing window. The Fourier transform of each x(n) is then computed and those are stacked column-by-column into X 2 RF N . These operations may be summed up in the following formula : S(x)f;n = wtxnL+t exp( i ) : (1.7) F =0 where w 2 RF is a smoothing window of size F , L is a shift parameter (also called hop size), or equivalently H = F L is the overlap parameter.
Thus each column of X gives a complete description of the frequency content of x locally in an interval of the form [nL; nL + F ].
More concisely, let S> be the conjugate transpose of S. We then have S>S = F Id, where Id 2 RT T is the identity matrix. A generalization of condition 1.8 exists if the analysis window in 1.7 and the synthesis window in 1.9 are not equal. In the rest of this thesis, Sy will stand for the inverse short time Fourier transform. Note that the term inverse is not meant in a mathematical sense, we will come back to this point later.
Once the STFT is computed, phase coe cients are discarded and either the magnitude coe cients jXfnj or the squared power jXfnj2 is kept. As noticed in the last Section, the magnitude of the Fourier transform jx^j is invariant by circular translation. As for the magnitude spectrogram, only translations by a multiple of the shift parameter preserve it strictly. For smaller shifts, the discrepancy between the magnitude spectrogram of the shifted signal and that of the original is small if the signal is stationary inside each analysis window : it is close to 0 where the signal is sinusoidal, and the largest discrepancies are found at transients.
On the other hand, enforcing translation invariance in linear models is so costly in terms of computational resources that, on the whole, the magnitude spectrogram is widely accepted as a convenient tool to provide approximate trans-lation invariance at minimal cost in terms of accuracy.
Recovery of source estimates via time-frequency masking
Given estimates of the source power spectrograms V procedure would consist in keeping the same phase as the input mixture for each source q ^(g) ^ (g) Sfn = Vfn exp(i fn) : (1.11) where fn is the phase of spectrogram X (modulo [0; 2 ]). However, source spec-trograms estimates are often noisy due to over-simplifying assumptions, subopti-mal solutions, etc. Using the reconstruction formula 1.11 would imply in partic- ^(g) 6= X. ular that source estimates do not add to the observed spectrogram g Sfn the input : Instead, it is preferable to compute source estimates by lteringP ^ (g)
(g) (g) (g) Vfn
Sfn = Mfn X where Mfn = : (1.12)
We will show in Section 2.3.2 that if a Gaussian model is assumed for the source spectrograms S(g), then this formula corresponds to computing Minimum Mean Square Estimates of the sources. These particular coe cients Mfn(g) will be referred to as Wiener masking coe cients, or oracle coe cients : indeed, for each time frame n and each source g, Mfn(g) may be interpreted as the f-th Fourier coe cient of a linear lter. Linear lters given determined by Formula 1.12 were derived by Wiener to estimate clean signals corrupted by Gaussian white noise.
Other probabilistic models imply di erent recovery formulae, see [Benaroya and Bimbot, 2003] for a discussion.
Ideal binary masks also work surprisingly well : (g) = 1 if V (g) > maxg0=g V (g0) Mfn fn 6 fn (1.13)
Models for audio source separation
Once a time-frequency representation has been computed, latent variable models are used to estimate the contribution of putative sources to the observed mixed signal. In this thesis we assume that the number of sources is known as well as the source types (voice, instrument, environmental noise), although the source signals are not. Latent variable models capture typical sounds emitted by each source in a compact model called a dictionary. Given a mixed signal, the most plausible combination of dictionary atoms of each source are searched for, and used to estimate source spectrograms.
The rst latent variable models for single-channel source separation where based on independent component analysis [Casey and Wetsner, 2000, Jang et al., 2003] (ICA). Note however that those works di er from classical approaches of ICA (see [Cardoso, 1998, Comon, 1994, Hyvarinen, 1999]), which require more channels than sources. In [Casey and Wetsner, 2000], each frequency in the STFT operator is considered as a channel. In this case, ICA can be viewed as an instance of a matrix factorization problem, sharing similarities with NMF but requiring di erent assumptions on the source signals. In [Jang et al., 2003], a single-channel recording is broken into a collection of 25 milliseconds’ long segments, without aplication of the Fourier transform. Those are passed as input to an ICA algorithm that enforces a translation-invariant representation of the signals.
At the same time, mixture models where proposed by [Roweis, 2001] to model the nonstationarity of source signals. While ICA may be seen as a matrix factor-ization technique similar in spirit to PCA, and is widely used for multichannel source separation, it relies on the assumption that source spectrograms are inde-pendent. In many audio signals such as music signals, this assumption is incor-rect, since several instruments may play notes which are very similar at similar times. NMF was then introduced as a natural way to circumvent this problem while keeping the idea of a low-rank approximation of the observed spectro-gram. It was rst presented as a tool for polyphonic transcription [Smaragdis and Brown, 2003], and was intensively studied in the following years.
In this Section, we will show how NMF may be seen as an extension of mixture models and outline the main challenges in learning appropriate models for source separation. In early contributions, one model was trained for each source in isolation and then models were combined at test time to infer the state of each source. These models proved successful in controlled experimental conditions, but in real-world source separation benchmarks, learning models directly on mixed data became primordial as training data is scarce and sometimes missing.
We begin this section by presenting mixture models with a special emphasis on the Gaussian (scaled) mixture model (GSMM) proposed by [Benaroya and Bimbot, 2003, Benaroya et al., 2006] for audio source separation and an appli-cation of hidden Markov models (HMM) proposed by [Roweis, 2001]. We then show that nonnegative matrix factorization may be seen as a relaxation of GSMM
Mixture models : hard sparsity constraints
Latent variables H(g) 2 f0; 1gKg N represent the state of each source at a given time bin. A global matrix H = ((H(1))>; : : : ; (H(G))>)> is created by concate-nating H(g) row by row.
To each source is associated a dictionary of template spectra W (g) 2 RF+ Kg . Each column of W (g) is interpreted as a typical spectrum observed in source g. Since the class of sources encountered in this thesis are quite general (voice, guitar, piano, etc.), it is reasonable to assume that each source emits several typical spectra, the collection of which corresponds to W (g).
These dictionaries are concatenated column by column to yield W = (W (1); : : : ; W (g)) 2 R+F K (1.14) where K = g Kg. Throughout this thesis we will assume that all Kg are equal to some constant K without loss of generality.
Scaling factors As noticed in [Benaroya et al., 2006], the number of compo-nents be may reduced by introducing scaling factors so that still only one compo-nent Hkn is active at a time, but it is allowed to take values in R+ to compensate for amplitude modulations in the spectrogram (typically a 5 seconds’ second pi-ano notes with very slow damping would require a linear number of components to quantize the variations of intensity whereas only one or two components su ce if scaling is allowed). We sketch this idea in Figure 1.6, where a scaled mixture model captures the whole data set with two components (dashed black lines), whereas mixture models need six components (red circles).
Learning: trained models or blind learning ?
The key to successful inference is that columns of W (g) should provide good estimates of spectrograms generated by source g and bad estimates of other sources. As proposed in [Benaroya and Bimbot, 2003, Roweis, 2001], the rst part of this statement is achieved by optimizing the likelihood of isolated sam-ples from source g, with respect to W (g). However, when benchmarks were in-troduced (SASSEC,SISEC, Chime, RWC), participants submitted mostly source separation systems where models where learnt partially or completely on mixed data, because training samples are sometimes missing or inadequate to represent mixed signals (variability in between instrument classes, in between singers for instance, usage of linear/nonlinear e ects on instruments in some mixed signals, etc.). Blind learning of W is a non-convex problem involving continuous variables
W and discrete variables H.
It is solved by an EM algorithm, which has the attractive property of being a descent algorithm [Dempster et al., 1977]. However, while for many choices of p(H) and p(V V ), inference in H is a convex problem, it is no longer the case when (W; H) are estimated jointly. This entails that there are many stationary points of the problem, and that the solution found by the EM algorithm will highly depend on the chosen initial point. In practice, several initial points are tried and that with the lowest objective cost function is kept.
Extension to hidden Markov models
In the early days of speech recognition, dynamic time warping (DTW) emerged as an essential tool for accurate recognition of phonemes [Sakoe and Chiba, 1978]. It was then superseded by hidden Markov models (HMM). Training HMMs for single-channel source separation was proposed in [Roweis, 2001].
This term replaces static prior information log p(H) in 1.17. The stochastic matrix P (g) describes the most likely pairwise sequences of states (Hkn(g); Hkn(g)+1). p(g), the a priori distribution of initial states, should be adapted to provide the best t to data. Figure 1.7 provides a graphical representation of the condi-tional independencies induced by this optimization problem. Inference in hidden Markov models is conducted with a forward-backward algorithm of complexity O(F K2GN), where in each pass, the objective function value must be evaluated for every combination of pairwise states ((k1n; k1n+1); : : : ; (kGn; kGn+1)).
Additional scaling factors may be added without changing the order of com-plexity, as shown in [Ozerov et al., 2009], at the cost however of a much larger multiplicative constant (typically 10 to 100 depending on the number of sources considered and the correlation of the design matrix W ). Note that Markov models of higher order can be introduced to tie together sequences of three, four, states or more, instead of two. However the complexity grows exponentially as O(F K(p+1)GN), where p is the order of the Markov model.
Alternative mixture models The ‘2 norm is a poor measure of distortion for audio signals, because the human ear is sensitive to variations in log scale (in dB). Multiplying the amplitude of sound by rather than V V . We will discuss this choice in more details in Chapter 2. Roweis  take a di erent point of view and use the ‘2 norm as a measure of distortion while transforming the input data into log(V ).
Nonnegative matrix factorization with sparsity constraints
In this Section, we introduce nonnegative matrix factorization. Similarly to the case of mixture models, there are several settings : either one learns a dictionary W (g) on training data for each source, and then uses the concatenated dictionary W to infer decomposition coe cients H on mixed data, or one learns blindly (W; H) directly on mixed data.
The key to successful inference is that W (g) should be good at reconstructing source g (interpretability) and bad at reconstructing the others (incoherence). We argue in Section 184.108.40.206 that sparsity penalties are needed to learn incoherent dictionaries.
In the case of blind learning, sparsity in itself may not be su cient to learn interpretable dictionaries and additional prior knowledge is necessary. The eas-iest case is when a learnt dictionary is provided, and must be re-calibrated on mixed data, rather than be learnt from scratch (Section 220.127.116.11). Taking into ac-count temporal dependencies is important when dealing with speech signals. We present in Section 18.104.22.168 counterparts of the Hidden Markov Model that have been proposed for NMF. Another way of enforcing prior knowledge is through re-parameterization as we will see in Section 22.214.171.124.
Trained learning with sparsity
Placing a hard constraint on the number of nonzero coe cients in H entails a complexity of order O(F KGN) . By relaxing this constraint, instead of tting a mixture of K one-dimensional half-lines, nonnegative matrix factorization con-sists in tting a K dimensional cone to the data :
fn log p(VfnjVfn) : (1.24)
W 0;H 0:
where Vfn = k WfkHkn. When only H is optimized for xed W , this problem is referred to as nonnegative regression or nonnegative matrix division (NMD). We will arbitrarily use the latter name.
With the same number of dictionary elements, a much larger portion of the space is spanned in NMF than in mixture models. However, this advantage comes at a cost. Consider the cloud of points in Figure 1.8b, for instance. Two pairs of dictionary elements (W 1; W 2) and (W 01; W 2) t the data equally well. However, the cone generated by W 0 is too big so learning W 0 on training data might t well but at test time, it might contain data points from other sources : they would be mistakenly represented as points from source 2.
Selecting model W rather than W 0 is hard problem. However, in this simple case we can see that, given a data point, the latent coe cients satisfy h1 + h2 < h01 + h02.
Table of contents :
1.1 Mixing assumptions and related tasks
1.1.1 Assumptions on the mixing process
1.1.2 Standard metrics for single-channel source separation
1.1.3 Related tasks
1.2 Time-frequency representations
1.2.1 Fourier Transform
1.2.2 Short time Fourier transform
1.2.3 Recovery of source estimates via time-frequency masking
1.3 Models for audio source separation
1.3.1 Mixture models : hard sparsity constraints
126.96.36.199 Learning: trained models or blind learning ?
188.8.131.52 Extension to hidden Markov models
1.3.2 Nonnegative matrix factorization with sparsity constraints
184.108.40.206 Trained learning with sparsity
220.127.116.11 Partially blind learning or model calibration
18.104.22.168 Smoothness penalties
22.214.171.124 Parameterized atoms
1.3.3 Other latent variable models : Complex Matrix Factorization
1.3.4 Other approaches : Computational Audio Scene Analysis
126.96.36.199 Basic complexity issues
188.8.131.52 A clustering approach to audio source separation
2 Structured NMF with group-sparsity penalties
2.1 The family of NMF problems
2.1.1 The family of beta-divergences
2.1.2 Identication problems in NMF
2.1.3 Determining which divergence to choose
2.2 Optimization algorithms for the -divergences
2.2.1 Non-convexity of NMD with beta-divergences
2.2.2 MM algorithms and multiplicative updates
2.2.3 Expectation-Maximization algorithms
184.108.40.206 Itakura-Saito divergence
220.127.116.11 Kullback-Leibler divergence
2.3 Itakura-Saito NMF
2.3.1 Generative model
2.3.2 Recovery of source estimates
2.3.3 Consistent source estimates
2.3.4 Derivation of a descent algorithm
2.3.5 Discussion of convergence properties
2.3.6 Discussion of convergence properties: empirical results
2.3.7 Overview of the algorithm
2.3.8 Related Work
2.4 Group-sparsity enforcing penalty in NMF
2.4.2 Interpretation of the penalty term
2.4.3 Extension to block-structured penalties
2.4.4 Algorithm for group Itakura Saito NMF
2.5 Model selection in sparse NMF
2.5.1 Kolmogorov-Smirnov statistic
2.5.2 Bayesian approaches
2.6 Experiments with group-sparse NMF
2.6.1 Validation on synthetic data
2.6.2 Results in single channel source separation
2.6.3 Block-structured penalty
2.7 Related work
3 Online NMF
3.1 Algorithm for online IS-NMF
3.1.1 Itakura-Saito NMF
3.1.2 Recursive computation of auxiliary function
3.1.3 Practical implementation
3.2 Experimental study
3.3 Related Work
4 User informed source separation
4.1 A GUI for time-frequency annotations
4.2 Annotated NMF
4.3 Towards automatic annotations
4.3.1 Learning algorithms
4.3.3 Raw Patches
4.3.4 Oriented lters
4.3.5 Averaging training labels
4.4 Experimental results
4.4.1 Description of music databases
4.4.2 Ideal performance and robustness
4.4.3 Evaluation of automatic annotations
4.4.4 Overall results
4.5.1 Detailed comparison of detectors
4.5.2 Extension of annotated NMF with more than one dominant source
18.104.22.168 Mathematical formulation
22.214.171.124 Source separation of three instrumental tracks
4.5.3 Handling uncertainty in automatic annotations
4.5.4 Predicting source specic time intervals
4.6 Related work
A Projected gradient descent
B Description of databases
C Detailed results