Mixing assumptions and related tasks
Assumptions on the mixing process
Given G source signals s(g) 2 RT , and one microphone, we assume the acquisition process is well modelled by a linear instantaneous mixing model : xt = st(g) : (1.1) =1
It is standard to assume that microphones are linear as long as the recorded signals are not too loud. If signals are too loud, they are usually clipped. The mixing process is modelled as instantaneous as opposed to convolutive. Indeed, when multiple microphones are used(I > 1), the output xit at each microphone can be modelled by : xit = G +1 (g;i) (g) (g;i) are impulse re-sponses depending on the relative position of source g to microphone i and the con guration of the room in which signals are recorded. While crucial when using multiple microphones, taking into account convolutive mixing is not as important in the case I = 1 : in this case, we would recover a linear transformation of each source which can then be processed independently.
In the multiple microphone setting, source separation, also known as the \cocktail party problem », or the un-mixing problem, has given birth to the tech-nique of independent component analysis (ICA)[Comon, 1994, Cardoso, 1998, Hyvarinen, 1999].
Standard metrics for single-channel source separation
Listening tests provide the most valuable insights on the quality of source esti-mates s^(g). Purely quantitative criteria, on the other hand, are much less time consuming, and provide a good check of source separation quality before listening tests. The simplest performance criterion is the signal to noise ratio:
SN R = 10 log k 2k2 : (1.2)
Current standard metrics were proposed in [Vincent et al., 2006] and consist in decomposing a given estimate s^(g) as a sum : s^t(g) = starget + einterf + eartif : (1.3) where starget is an allowed deformation of the target source s(g), einterf is an allowed deformation of the sources which accounts for the interferences of the unwanted sources, eartif is an \artifact » term that accounts for deformations induced by the separation algorithm that are not allowed. Given such a decom-position, one can compute the following criteria
Higher values of the metrics are sought for (+1 means the true source signal is recovered exactly, that anything but the source signal can be heard in s^). Values of the SDR are hard to interpret, they depend highly on the structure of the signal. The SDR of a proposed method must always be compared to that obtained when using the mixed signal as a source estimate, in order to measure the improvement accomplished rather than an absolute value that is not always meaningful. 1. In Table 1.1 this is displayed in the column self. As we can see, the SDR values obtained by the proposed methods are above that threshold, which means that an improvement was obtained in extracting the source from the mixture. Incidentally, note that the SAR obtained by self is always 1, since the mixed signal x is a linear combination of source signals with no additional noise.
Allowing simple deformations of the target signal is important : for instance, if the estimate s^(g) was simply a scaled version of the true source signal s(g) then perceptually the result would be perfect, but the SNR would be 10 log10 2 2 (1 ) which can be arbitrarily low. However, the SDR and SAR (of this source) would still be +1 because scaling is one of the allowed deformations2. Other deforma-tions can be allowed in the bss eval toolbox released by Vincent et al. , which are especially useful in a multichannel setting.
Single-channel source separation may be useful in many di erent scenarios :
• Fine level decomposition : for each instrument, each note must be extracted as an individual source. In contrast with polyphonic music transcription, recovering source estimates may be useful in this case to perform modi – cations such as modifying notes or changing the length of a note without changing that of the others.
• Instrument per instrument separation : this is the kind of task proposed in separation campaigns such as MIR-1K or SISEC. While there may be several instruments, in popular songs those are highly synchronized. Due to its particular role in Western popular music, it is of particular interest to extract voice from the accompaniment. It might be also interesting to extract solo instruments (e.g., electric guitar in rock and roll recordings).
• \Nonstationary » denoising : while denoising has been a successful applica-tion of signal processing since the 1970’s, it is limited to stationary noise e.g., ambient noise in room recordings, the spectral properties of which are considered as constant throughout time. In eld recordings, on the con-trary, or in real-life situations, what we commonly experience as noise may have time-varying spectral properties : wind, tra c, water trickling, etc. In this case, denoising is no longer an appropriate term and source separation methods should be used.
• Source separation can also be used as a pre-processing step for other tasks, such as automatic speech recognition (ASR). When there are sources of interference (other voices for instance), source separation could be used to clean the target speech signal from interferences and feed it to a standard recognizer. This approach to speech enhancement is used by many partici-pants in the Pascal Chime Challenge and allows achieving recognition rates that are close to that of human hearing.
• SISEC3 : held in 2008,2010, 2011. Several tasks proposed, including Pro-fessionally Produced Music Recordings. Short 10 seconds’ excerpts from songs are provided as well as original sources as train data. For test data, other songs were released in full. This data set is particularly challenging because instruments change entirely from train to test. Moreover, train-ing samples are very short (10 seconds), so that sophisticated methods are likely to over t.
• QUASI4 : released in 2012, is very similar to the Professionally Produced Music Recordings task of SISEC, but provides more training data.
• Chime challenge5 : held in 2011, it aims at evaluating the performance of automatic speech recognizers (ASR) in real-world environments. Sen-tences from the grid corpus are mixed with environmental noise. There are several hours of training data for environmental noise and clean speech. In this setting, source separation algorithms were key in providing reliable estimates of clean speech on test data.
• NOIZEUS dataset6 : this corpus has thirty short English sentences (each about three seconds long) spoken by three female and three male speakers.
Time-frequency representations of sound signals
In this short section we present as much mathematical content as necessary for the reader to understand their essential properties, in particular the short time Fourier Transform. A complete presentation of time-frequency representations can be found in books such as [Kovacevi et al., 2012, Mallat, 2008].
Given a time signal x 2 RT , the Fourier transform of x is de ned as : Xt T 1 2 kt
x^k = xt exp( i ) k = 0 : : : T 1 : (1.4)
x^ has Hermitian symmetry around T =2:
8k; x^T k = x^k? k = 0 : : : T 1 : (1.5)
Hermitian symmetry compensates the fact that x^ lives in CT which has twice as many dimensions as RT .
Coe cient k of the Fourier transform describes the contribution of a sinusoid at frequency fs k=T , from 0Hz to the Shannon-Nyquist rate fs=2Hz. In sound signals, coe cient k = 0 is always zero, and Fourier coe cients decay fast with k, provided there are no discontinuities in the signal (which happens at notes onset and o set in music signals, at the beginning and end of utterances in speech, and so on).
The Fourier transform is invertible and conserves energy, i.e., 8x; kxk2 = kx^k2 (1.6) so the contribution of all sinusoids is su cient to describe the whole signal. Note that a circular translation of x does not change the modulus of its Fourier coe cients : thus if we keep only magnitude coe cients jx^j we obtain a translation invariant representation of x.
It is particularly relevant for sound signals at time scales of the order of a few tens of milliseconds, who typically have few nonzero coe cients in the Fourier domain, but are very dense in the time domain. This sparsity property is exploited in lossy compression schemes such as AAC.
On the other hand, at larger time scales, localized structures such as sequences of vowels in speech or of notes in music cannot be described by the Fourier transform, while they are easily discriminable in the time domain. Trading o time localization versus frequency localization is at the heart of time-frequency representations of signal, as we shall now see Time-frequency representations are a compromise between time and frequency localized properties of sound signals. In this thesis, we use a well-known time-frequency representation, called the short time Fourier Transform (STFT).
The STFT is computed in two steps : rst the signal is cut into smaller pieces and multiplied by a smoothing window. The Fourier transform of each x(n) is then computed and those are stacked column-by-column into X 2 RF N . These operations may be summed up in the following formula :
F 1 2 f t
S(x)f;n = wtxnL+t exp( i ) : (1.7)
where w 2 RF is a smoothing window of size F , L is a shift parameter (also called hop size), or equivalently H = F L is the overlap parameter.
Thus each column of X gives a complete description of the frequency content of x locally in an interval of the form [nL; nL + F ].
More concisely, let S> be the conjugate transpose of S. We then have S>S = F Id, where Id 2 RT T is the identity matrix. A generalization of condition 1.8 exists if the analysis window in 1.7 and the synthesis window in 1.9 are not equal. In the rest of this thesis, Sy will stand for the inverse short time Fourier transform. Note that the term inverse is not meant in a mathematical sense, we will come back to this point later.
Once the STFT is computed, phase coe cients are discarded and either the magnitude coe cients jXfnj or the squared power jXfnj2 is kept. As noticed in the last Section, the magnitude of the Fourier transform jx^j is invariant by circular translation. As for the magnitude spectrogram, only translations by a multiple of the shift parameter preserve it strictly. For smaller shifts, the discrepancy between the magnitude spectrogram of the shifted signal and that of the original is small if the signal is stationary inside each analysis window : it is close to 0 where the signal is sinusoidal, and the largest discrepancies are found at transients.
On the other hand, enforcing translation invariance in linear models is so costly in terms of computational resources that, on the whole, the magnitude spectrogram is widely accepted as a convenient tool to provide approximate trans-lation invariance at minimal cost in terms of accuracy.
Recovery of source estimates via time-frequency masking
Given estimates of the source power spectrograms V procedure would consist in keeping the same phase as the input mixture for each source ^(g) ^ (g) Sfn = Vfn exp(i fn) : (1.11) where fn is the phase of spectrogram X (modulo [0; 2 ]). However, source spec-trograms estimates are often noisy due to over-simplifying assumptions, subopti-mal solutions, etc. Using the reconstruction formula 1.11 would imply in particular that source estimates do not add to the observed spectrogram g Sfn
the input :
Instead, it is preferable to compute source estimates by lteringP
We will show in Section 2.3.2 that if a Gaussian model is assumed for the source spectrograms S(g), then this formula corresponds to computing Minimum Mean Square Estimates of the sources. These particular coe cients Mfn(g) will be referred to as Wiener masking coe cients, or oracle coe cients : indeed, for each time frame n and each source g, Mfn(g) may be interpreted as the f-th Fourier coe cient of a linear lter. Linear lters given determined by Formula 1.12 were derived by Wiener to estimate clean signals corrupted by Gaussian white noise.
Models for audio source separation
Once a time-frequency representation has been computed, latent variable models are used to estimate the contribution of putative sources to the observed mixed signal. In this thesis we assume that the number of sources is known as well as the source types (voice, instrument, environmental noise), although the source signals are not. Latent variable models capture typical sounds emitted by each source in a compact model called a dictionary. Given a mixed signal, the most plausible combination of dictionary atoms of each source are searched for, and used to estimate source spectrograms.
The rst latent variable models for single-channel source separation where based on independent component analysis [Casey and Wetsner, 2000, Jang et al., 2003] (ICA). Note however that those works di er from classical approaches of ICA (see [Cardoso, 1998, Comon, 1994, Hyvarinen, 1999]), which require more channels than sources. In [Casey and Wetsner, 2000], each frequency in the STFT operator is considered as a channel. In this case, ICA can be viewed as an instance of a matrix factorization problem, sharing similarities with NMF but requiring di erent assumptions on the source signals. In [Jang et al., 2003], a single-channel recording is broken into a collection of 25 milliseconds’ long segments, without aplication of the Fourier transform. Those are passed as input to an ICA algorithm that enforces a translation-invariant representation of the signals.
At the same time, mixture models where proposed by [Roweis, 2001] to model the nonstationarity of source signals. While ICA may be seen as a matrix factor-ization technique similar in spirit to PCA, and is widely used for multichannel source separation, it relies on the assumption that source spectrograms are inde-pendent. In many audio signals such as music signals, this assumption is incor-rect, since several instruments may play notes which are very similar at similar times. NMF was then introduced as a natural way to circumvent this problem while keeping the idea of a low-rank approximation of the observed spectro-gram. It was rst presented as a tool for polyphonic transcription [Smaragdis and Brown, 2003], and was intensively studied in the following years.
In this Section, we will show how NMF may be seen as an extension of mixture models and outline the main challenges in learning appropriate models for source separation. In early contributions, one model was trained for each source in isolation and then models were combined at test time to infer the state of each source. These models proved successful in controlled experimental conditions, but in real-world source separation benchmarks, learning models directly on mixed data became primordial as training data is scarce and sometimes missing.
We begin this section by presenting mixture models with a special emphasis on the Gaussian (scaled) mixture model (GSMM) proposed by [Benaroya and Bimbot, 2003, Benaroya et al., 2006] for audio source separation and an appli-cation of hidden Markov models (HMM) proposed by [Roweis, 2001]. We then show that nonnegative matrix factorization may be seen as a relaxation of GSMM where the sparsity of each source is no longer xed to one.
Mixture models : hard sparsity constraints
Latent variables H(g) 2 f0; 1gKg N represent the state of each source at a given time bin. A global matrix H = ((H(1))>; : : : ; (H(G))>)> is created by concate-nating H(g) row by row.
To each source is associated a dictionary of template spectra W (g) 2 RF+ Kg . Each column of W (g) is interpreted as a typical spectrum observed in source g. Since the class of sources encountered in this thesis are quite general (voice, guitar, piano, etc.), it is reasonable to assume that each source emits several typical spectra, the collection of which corresponds to W (g).
Scaling factors As noticed in [Benaroya et al., 2006], the number of compo-nents be may reduced by introducing scaling factors so that still only one compo-nent Hkn is active at a time, but it is allowed to take values in R+ to compensate for amplitude modulations in the spectrogram (typically a 5 seconds’ second pi-ano notes with very slow damping would require a linear number of components to quantize the variations of intensity whereas only one or two components su ce if scaling is allowed). We sketch this idea in Figure 1.6, where a scaled mixture model captures the whole data set with two components (dashed black lines), whereas mixture models need six components (red circles).
Solving for H in 1.18 is still of order O(F QGN), but with a much higher multiplicative constant since for each G-uplet of states (k1; : : : ; kG), a nonnegative matrix division problem must be solved. There is a tradeo to make between the decrease in the number of components Q and that multiplicative constant.
Learning: trained models or blind learning ?
The key to successful inference is that columns of W (g) should provide good estimates of spectrograms generated by source g and bad estimates of other sources. As proposed in [Benaroya and Bimbot, 2003, Roweis, 2001], the rst part of this statement is achieved by optimizing the likelihood of isolated sam-ples from source g, with respect to W (g). However, when benchmarks were in-troduced (SASSEC,SISEC, Chime, RWC), participants submitted mostly source separation systems where models where learnt partially or completely on mixed data, because training samples are sometimes missing or inadequate to represent mixed signals (variability in between instrument classes, in between singers for instance, usage of linear/nonlinear e ects on instruments in some mixed signals, etc.). Blind learning of W is a non-convex problem involving continuous variables
W and discrete variables H.
It is solved by an EM algorithm, which has the attractive property of being a descent algorithm [Dempster et al., 1977]. However, while for many choices of p(H) and p(V V ), inference in H is a convex problem, it is no longer the case when (W; H) are estimated jointly. This entails that there are many stationary points of the problem, and that the solution found by the EM algorithm will highly depend on the chosen initial point. In practice, several initial points are tried and that with the lowest objective cost function is kept.
Extension to hidden Markov models
In the early days of speech recognition, dynamic time warping (DTW) emerged as an essential tool for accurate recognition of phonemes [Sakoe and Chiba, 1978]. It was then superseded by hidden Markov models (HMM). Training HMMs for single-channel source separation was proposed in [Roweis, 2001]. In principle, a (factorial) HMM consists in introducing a probabilistic model for H with the following minus log-likelihood : X Xk
This term replaces static prior information log p(H) in 1.17. The stochastic matrix P (g) describes the most likely pairwise sequences of states (Hkn(g); Hkn(g)+1). p(g), the a priori distribution of initial states, should be adapted to provide the best t to data. Figure 1.7 provides a graphical representation of the condi-tional independencies induced by this optimization problem. Inference in hidden Markov models is conducted with a forward-backward algorithm of complexity O(F K2GN), where in each pass, the objective function value must be evaluated for every combination of pairwise states ((k1n; k1n+1); : : : ; (kGn; kGn+1)).
Additional scaling factors may be added without changing the order of com-plexity, as shown in [Ozerov et al., 2009], at the cost however of a much larger multiplicative constant (typically 10 to 100 depending on the number of sources considered and the correlation of the design matrix W ). Note that Markov models of higher order can be introduced to tie together sequences of three, four, states or more, instead of two. However the complexity grows exponentially as O(F K(p+1)GN), where p is the order of the Markov model.
Alternative mixture models The ‘2 norm is a poor measure of distortion for audio signals, because the human ear is sensitive to variations in log scale (in dB). Multiplying the amplitude of sound by 10 leads to an increase of 10dB whereas the ‘2 norm is sensitive to linear variations. The Itakura-Saito divergence is introduced in [Fevotte et al., 2009] to measure distortion as a function of rather than V V . We will discuss this choice in more details in Chapter 2. Roweis  take a di erent point of view and use the ‘2 norm as a measure of distortion while transforming the input data into log(V ). log V = log V (g) = log exp(log V (g)) : (1.22)
Since the function log(exp(x1) + exp(x2)) is roughly equivalent to max(x1; x2), [Roweis, 2001] propose replacing + by max in the mixture model, i.e., log Vfn = max log( Wfk Hkn ) :
Nonnegative matrix factorization with sparsity constraints
In this Section, we introduce nonnegative matrix factorization. Similarly to the case of mixture models, there are several settings : either one learns a dictionary W (g) on training data for each source, and then uses the concatenated dictionary W to infer decomposition coe cients H on mixed data, or one learns blindly (W; H) directly on mixed data.
The key to successful inference is that W (g) should be good at reconstructing source g (interpretability) and bad at reconstructing the others (incoherence). We argue in Section 184.108.40.206 that sparsity penalties are needed to learn incoherent dictionaries.
In the case of blind learning, sparsity in itself may not be su cient to learn interpretable dictionaries and additional prior knowledge is necessary. The eas-iest case is when a learnt dictionary is provided, and must be re-calibrated on mixed data, rather than be learnt from scratch (Section 220.127.116.11). Taking into ac-count temporal dependencies is important when dealing with speech signals. We present in Section 18.104.22.168 counterparts of the Hidden Markov Model that have been proposed for NMF. Another way of enforcing prior knowledge is through re-parameterization as we will see in Section 22.214.171.124.
Trained learning with sparsity
Placing a hard constraint on the number of nonzero coe cients in H entails a complexity of order O(F KGN) . By relaxing this constraint, instead of tting a mixture of K one-dimensional half-lines, nonnegative matrix factorization con-sists in tting a K dimensional cone to the data :
fn log p(VfnjVfn) : (1.24)
W 0;H 0:
where Vfn = k WfkHkn. When only H is optimized for xed W , this problem is referred to as nonnegative regression or nonnegative matrix division (NMD). We will arbitrarily use the latter name.
With the same number of dictionary elements, a much larger portion of the space is spanned in NMF than in mixture models. However, this advantage comes at a cost. Consider the cloud of points in Figure 1.8b, for instance. Two pairs of dictionary elements (W 1; W 2) and (W 01; W 2) t the data equally well. However, the cone generated by W 0 is too big so learning W 0 on training data might t well but at test time, it might contain data points from other sources : they would be mistakenly represented as points from source 2.
Selecting model W rather than W 0 is hard problem. However, in this simple case we can see that, given a data point, the latent coe cients satisfy h1 + h2 < h01 + h02.
Thus, a reasonable model selection procedure would consist in learning W for P good reconstruction while enforcing an upper-bound on the value of k Hkn for every n.
log p(V njVfn) :
P k Hkn C (1.25)
H 0 :
or equivalently learn W to minimize the reconstruction cost penalized by the sum of coe cients :
min log p(V ^ Pk;n H V ) + subject to njHfn 0 : kn (1.26)
This is a simple illustration of the bene ts of sparsity in the particular case of NMF. In the wider scope of dictionary learning (with or without nonnegativity constraints), there are other ways of learning source speci c dictionaries, see e.g., [Ramirez et al., 2010].
Partially blind learning or model calibration
Blind learning may be too hard in the sense that many local minima exist that do not yield satisfactory source estimates. However, given rough estimates of ~ (g) ~ (g) ) from prior training data, one may enforce the additional constraints (W ; H that estimates computed on new mixed data should be close to (W ; H). This type of solution can be straightforwardly addressed in a penalized likelihood setting, where prior distributions p(W jW ) and p(HjH) are chosen to be concentrated around W and H. For instance, when V is modelled as a multinomial, a Dirichlet modelled as gamma random variable, an inverse gamma prior with scale W could also be used.
Table of contents :
1.1 Mixing assumptions and related tasks
1.1.1 Assumptions on the mixing process
1.1.2 Standard metrics for single-channel source separation
1.1.3 Related tasks
1.2 Time-frequency representations
1.2.1 Fourier Transform
1.2.2 Short time Fourier transform
1.2.3 Recovery of source estimates via time-frequency masking
1.3 Models for audio source separation
1.3.1 Mixture models : hard sparsity constraints
126.96.36.199 Learning: trained models or blind learning ?
188.8.131.52 Extension to hidden Markov models
1.3.2 Nonnegative matrix factorization with sparsity constraints
184.108.40.206 Trained learning with sparsity
220.127.116.11 Partially blind learning or model calibration
18.104.22.168 Smoothness penalties
22.214.171.124 Parameterized atoms
1.3.3 Other latent variable models : Complex Matrix Factorization
1.3.4 Other approaches : Computational Audio Scene Analysis
126.96.36.199 Basic complexity issues
188.8.131.52 A clustering approach to audio source separation
2 Structured NMF with group-sparsity penalties
2.1 The family of NMF problems
2.1.1 The family of beta-divergences
2.1.2 Identication problems in NMF
2.1.3 Determining which divergence to choose
2.2 Optimization algorithms for the -divergences
2.2.1 Non-convexity of NMD with beta-divergences
2.2.2 MM algorithms and multiplicative updates
2.2.3 Expectation-Maximization algorithms
184.108.40.206 Itakura-Saito divergence
220.127.116.11 Kullback-Leibler divergence
2.3 Itakura-Saito NMF
2.3.1 Generative model
2.3.2 Recovery of source estimates
2.3.3 Consistent source estimates
2.3.4 Derivation of a descent algorithm
2.3.5 Discussion of convergence properties
2.3.6 Discussion of convergence properties: empirical results
2.3.7 Overview of the algorithm
2.3.8 Related Work
2.4 Group-sparsity enforcing penalty in NMF
2.4.2 Interpretation of the penalty term
2.4.3 Extension to block-structured penalties
2.4.4 Algorithm for group Itakura Saito NMF
2.5 Model selection in sparse NMF
2.5.1 Kolmogorov-Smirnov statistic
2.5.2 Bayesian approaches
2.6 Experiments with group-sparse NMF
2.6.1 Validation on synthetic data
2.6.2 Results in single channel source separation
2.6.3 Block-structured penalty
2.7 Related work
3 Online NMF
3.1 Algorithm for online IS-NMF
3.1.1 Itakura-Saito NMF
3.1.2 Recursive computation of auxiliary function
3.1.3 Practical implementation
3.2 Experimental study
3.3 Related Work
4 User informed source separation
4.1 A GUI for time-frequency annotations
4.2 Annotated NMF
4.3 Towards automatic annotations
4.3.1 Learning algorithms
4.3.3 Raw Patches
4.3.4 Oriented lters
4.3.5 Averaging training labels
4.4 Experimental results
4.4.1 Description of music databases
4.4.2 Ideal performance and robustness
4.4.3 Evaluation of automatic annotations
4.4.4 Overall results
4.5.1 Detailed comparison of detectors
4.5.2 Extension of annotated NMF with more than one dominant source
18.104.22.168 Mathematical formulation
22.214.171.124 Source separation of three instrumental tracks
4.5.3 Handling uncertainty in automatic annotations
4.5.4 Predicting source specic time intervals
4.6 Related work
A Projected gradient descent
B Description of databases
C Detailed results