Perceptually Important Statistics of Sound Textures
Julesz’s conjecture (Julesz, 1962) states that humans cannot distinguish between visual textures with identical statistics, the idea holds for most of the visual textures (Portilla and Simoncelli, 2000). This means that features of a visual texture could be well pre-sented by a statistical model. It should be also true for sound textures if the perceptually important properties of sound textures are described by means of statistical features. Following this assumption, the perceptually important statistics could also characterize the randomized events and the structures which are created by those hidden stochastic processes in sound textures.
In the case of sound textures, statistics directly evaluated from the sound samples may not be perceptually meaningful. An example is that, randomly placed impulses with enough density would sound the same as a Gaussian noise. In this case, time domain moments are very diﬀerent between these two sounds, but these sounds have similar moments in the time-frequency domain. This will be discussed in chapter 3 and depicted in Fig. 3.2. McDermott’s work (McDermott and Simoncelli, 2011a) used a perceptual based ERB-filterbank(Glasberg and Moore, 1990) to divide a signal into several sub-bands. We call these divided signal components in each sub-band as sub-band signals. In the same article, he suggests that a proper description of a sound texture is composed of envelope statistics of sub-band signals. He also published an experiment result in (McDermott et al., 2013), further suggesting that time-averaged statistics are crucial to human perception of sound textures. From McDermott’s works, he proposes that the perceptually important statistical properties consist of at least three components: moments, temporal correlation and spectral correlation.
In addition to the three perceptually important statistical properties, (Mesgarani et al., 2009) and (Depireux et al., 2001) suggest another property, the spectro-temporal cor-relation. These works explained why spectro-temporal ripples considered important to human when recognizing frequency modulated structures. In (Mesgarani et al., 2009), it states that the spectro-temporal stimuli reconstruction can be improved with the prior knowledge of statistical regularities. Depireux’s(Depireux et al., 2001) work indicates that a spectro-temporal envelope cannot always be separated into pure temporal and pure spectral response functions in the auditory cortex. This inseparability indicates that there exist some spectrally and temporally intertwined stages of processing in the auditory system. According to these experiment results, it seems reasonable to treat spectro-temporal correlations as one of perceptually important statistics.
The following subsections will briefly introduce these properties and discuss how we establish a statistical model based on these previous discoveries.
While the temporal correlation and spectral correlation characterize the horizontal and vertical relationships in the time-frequency representation, there are also spectro-temporal structures in the time-frequency representation. The slant relationship is the correlation involved in both time and frequency; one can consider it as a spectral correlation with a time delay. It is common that one frequency component is likely to appear after a brief delay of the appearance of another frequency component, for example, chirps and vibrating whispers.
In the time-frequency domain, spectro-temporal correlation can be characterized by the delayed cross-correlation, which are the terms with non-zero lags in cross-correlation functions(1.3).
Cross-Correlation Function(CCF): ∞ Cx,y(τ ) = x(t) (t + τ ) (1.3).
Most of the works about sound texture synthesis fall into this category. However, the va-riety between these models is large. In (Hanna and Desainte-catherine, 2003), stochastic noises are synthesized with randomized short sinusoids. The intensity of noise is kept during the transformation by preserving the statistical moments with the short sinu-soids. (Athineos and Ellis, 2003) uses linear prediction in both time and frequency domain(TFLPC), preserving the short-term temporal correlations in order to synthe-size the brief transients in sound texture. Zhu(Zhu and Wyse, 2004) proposed another TFLPC-based approach with a two-level foreground-background representation. The foreground is the transients and the background is the remainder. The two levels are synthesized separately then mixed together. The two-level approach also appears in the environmental sound scene processor proposed by (Misra et al., 2006). Verron(Verron et al., 2009) proposed a parametric synthesizer which was based on additive synthesis. If proper parameters are given, the synthesizer is capable of generate environmental sounds such as rain or fire from limited number of basic sounds. Bruna(Bruna and Mal-lat, 2011) proposed a new wavelet transform which provides a better view of textures while capturing high-order statistics. Kersten(Kersten and Purwins, 2012)’s work aimed to re-synthesis the fire crackling sound with a model similar to the common foreground-background model(Athineos and Ellis, 2003, Misra et al., 2006, Zhu and Wyse, 2004).
McDermott(McDermott et al., 2009) proposed a model which adapted from (Portilla and Simoncelli, 2000), which resynthesizes target sound textures with high order statistics and spectral correlations between sub-band signal envelopes. In his article, statistical properties are applied to Gaussian noises to generate diﬀerent sound textures. The study of McDermott was the first to support the view that sound texture perception is related to the moments and correlations that may be calculated in diﬀerent locations of the auditory system. Later in (McDermott and Simoncelli, 2011a), the model was further refined. The model uses ERB-filterbank to divide the signal into perceptual bands. Within each perceptual band, each band was further divided into smaller modulation bands. The statistics he uses includes the first four moments of each modulation band and perceptual band, along with the cross-correlations between neighbouring perceptual bands and modulation bands.
There are some other works seek to develop a high quality synthesize algorithm for a specific sound texture by studying the physical process which generates the sound. These approaches analyse the physical processes which induced the sound texture, then use the result as a footstep to develop algorithms. In some way, it is possible to analyse the physical process of a specific kind of sound texture then develop high quality synthesis and transformation. (Oksanen et al., 2013) used parallel waveguide models to synthesis the sound of jackhammers. His work is capable of synthesizing a series of jackhammer impacts. (Conan et al., 2013) proposes a model that characterizes rubbing, scratching and rolling sounds as series of impacts. The work comes with a schematic control strategy to achieve sound morphing between these sounds by preserving the invariants between diﬀerent kinds of sounds.
Rather than proposing a model or dealing with the statistics directly, some works aim to resynthesis sound textures with granular synthesis. Dubnov (Dubnov et al., 2002) processes a sound texture in the form of wavelet tree and reshuﬄes sub-trees if their diﬀerence is smaller than a threshold. This results into a new sample of the texture, which the new sample is fully composed by the rearranged segments in the original sample. Fr¨ojd(Fr¨ojd and Horner, 2009) proposed a straightforward approach. Several blocks are cut from the original sound texture. New samples are generated by rearranging and cross-fading these blocks. O’Leary(O’Leary and Robel, 2014) took a path similar to Dubnov to synthesis sound textures without using wavelets. In his work, he searches atoms in the time-frequency domain by evaluating the correlations. The algorithm then finds a proper point to cut and rearrange these atoms. More variety can be achieved if atoms can also be replicated instead of only shuﬄing. He called this mechanism the ’montage approach’. Both of these algorithms create new samples with very little to none artifacts. This is a great advantage for this kind of algorithms. On the other hand, Schwarz(Schwarz, 2004) proposed a diﬀerent approach. He proposed a descriptor-driven, corpus-based approach to synthesize sound textures. The input texture was first transformed into audio descriptors, then the synthesis proceeds by combining sounds selected from the corpus database. The sounds are selected such that the combination of these sounds fit the audio descriptors. His work is more close to an orchestration system dedicated for sound textures. Later he proposes a statistical model(Schwarz and Schnell, 2010) which uses histogram and Gaussian mixture model to model the descriptors. This model enhances the controllability of his corpus-based synthesis.
Selection of Time-Frequency Representation
Time-Frequency Representation(TFR) is a two-dimensional signal representation over both time and frequency(Cohen, 1995). It provides a way to analyze signals in the time-frequency domain. In this thesis, we discuss only a subset of TFRs that are achievable by filter banks. TFRs are usually composed of complex-valued coeﬃcients over the time-frequency domain. The columns are also known as analysis frames.
The selection of time-frequency representation for sound texture processing is not a triv-ial issue. In the case of processing speech or instrumental sounds, STFT is a suitable choice. The linear frequency scale of the STFT is convenient when dealing with the speech/instrumental harmonic structures. However, since most of the sound textures do not have harmonic structures, the feasibility of STFT should be re-considered. Another problem is that, from McDermott’s work(McDermott and Simoncelli, 2011a), his exper-iments were conducted with perceptual bands, which is a logarithmic frequency scale. It is important to know whether his conclusions also applicable with linear frequency scales.
The next subsections will discuss what kind of TFRs are suitable to be the basis of a time-frequency domain statistical description for sound textures.
Data Grid Regularity
Most of the traditional TFRs have a regular sampling interval across the same frequency bin. The time diﬀerence between two horizontal neighbouring coeﬃcients is always the same, like STFT and CQT. Some TFRs do not follow this rule, for example, the non-stationary Gabor frame(Balazs et al., 2011) and the adaptive spectrogram(Chi et al., 2005, Liuni et al., 2011). In order to evaluate autocorrelation and cross-correlation functions, we will need the regularity applies on both time and frequency axis. That is, a regular sampling interval, which applies to all frequency, bins, like the STFT. The condition may sound strict, but many irregular TFRs can achieve this with configuration changes at the cost of extra computation including most of the Wavelet transforms. For example, the invertible CQT combined with the non-stationary Gabor frame (ERBlet constant-Q transform via non-stationary Gabor filterbanks)(Necciari et al., 2013).
STFT v.s. invertible ERBlet CQT
It seems that STFT fits both criteria and is less computational intensive. With the conditions above, STFT seems to be a reasonable choice. The only problem lies in the linear frequency scale of STFT. All the McDermott’s theories are based on perceptual frequency band, which is a logarithmic frequency scale. However, we can show that, if a perceptual band is divided into several smaller sub-bands, preserving the correlation of the sub-bands will also preserve the autocorrelation of the perceptual band.
Statistical Description of Sound Textures over TFR 22 If a real signal s = x + y, preserving the correlation functions of x and y will also preserve the auto-correlation of s(x, y are real). Consider the equation below: ∞ Ax+y(τ ) = (x(t) + y(t))(x(t + τ ) + y(t + τ )) dt −∞ ∞ = (x(t)x(t + τ ) + y(t)x(t + τ ) + x(t)y(t + τ ) + y(t)y(t + τ )) dt (3.1) −∞ = Ax(τ ) + Cy,x(τ ) + Cx,y(τ ) + Ay(τ ) , x ∈ R, y ∈ R = Ax(τ ) + Ay(τ ) + 2Cx,y(τ ).
According to the result of (3.1), preserving the autocorrelation and cross-correlation of individual elements will also preserve the autocorrelation function of the summation. It means that, if a perceptual band is divided into several linear frequency bands, pre-serving the correlation functions of these bands will also preserve the autocorrelation function of the perceptual band. This result strengthens the feasibility of linear frequency scales. However, if the frequency resolution is too low in a linear frequency setup, one linear frequency band may contain multiple perceptual bands in low frequency parts. In this case, preserving the correlation of the linear frequency band cannot guarantee the correlation functions for those underlying perceptual bands. Therefore, the frequency resolution should be chosen such that the bandwidth of each bin is not greater than the narrowest perceptual band.
Another good choice is the invertible ERBlet CQT(Necciari et al., 2013). It has a logarithmic scale frequency, which fits the auditory perception. A setup which satisfies the data grid regularity can be done with the aid of LTFAT toolbox(Pr˚uˇsa et al., 2014). The resynthesized ERBlet CQT spectrogram is shown in Fig. 3.1. Unfortunately, even though the proposed algorithm does properly generated magnitudes for the TFR, we cannot assign proper phase values for it. Conventional phase reconstruction algorithms do not work for the invertible ERBlet CQT. Therefore, in the end, we select STFT as the base TFR for the statistical description to achieve sound texture analysis/synthesis. However, readers should notice that, the proposed statistical description can be applied on any TFR which satisfies the two criteria above.
Table of contents :
Declaration of Authorship
List of Figures
List of Tables
1.1 What are Environmental Sounds and Sound Textures
1.2 Research Motivation
1.3 Difficulties of Sound Texture Transformation
1.4 Perception and Statistics
1.5 Signal Representation
1.6 Perceptually Important Statistics of Sound Textures
1.6.2 Temporal Correlation
1.6.3 Spectral Correlation
1.6.4 Spectro-Temporal Correlation
2 State of the Art
2.1 Early Attempts
2.2 Model-Based Synthesis
2.3 Granular-Based Synthesis
3 Statistical Description of Sound Textures over TFR
3.1 Selection of Time-Frequency Representation
3.1.2 Data Grid Regularity
3.1.3 STFT v.s. invertible ERBlet CQT
3.2 Overview of the Statistical Description
3.3 Evaluate Statistics from TFR
4 Imposing Statistics
4.1 Full Imposition of Correlation Functions
4.2 Imposition of Statistical Moments
4.2.1 Temporal Domain Imposition
4.2.2 Spectral Domain Imposition
4.3 Partial Imposition of Correlation Functions
5 Proposed Method, Summary
5.2.1 Initialization, Preprocessing
5.2.2 Correlation Function Imposition
5.2.3 Moment Imposition
5.2.4 Phase Reconstruction
6.1 Objective Evaluation
6.1.2 Measurement of Statistics of Resynthesized Sounds
6.2 Subjective Evaluation
6.2.1 Experiment 1: The effect of different cross-correlation function length
6.2.2 Experiment 2a: Compare with Bruna’s work
6.2.3 Experiment 2b: Compare with McDermott’s work
7 Conclusion & Perspectives
A Raw Moments in Terms of Spectral Correlation Functions
B The Complex Differentiability of the partial correlation imposition
C The SNR(Signal-toNoise Ratio) of Correlation Functions for the Sound Texture Samples