Interactions Between Sequential and Simultaneous Integrations

Get Complete Project Material File(s) Now! »

General Acoustic Regularities used for Primitive Segregation

As explained above, these regularities of the world are used for scene analysis, even if the listener is not familiar with the signal. Bregman reports four of them that have been identified as utilized by the auditory system:
1. It is extremely rare that sounds without any relations between them start and stop precisely at the same time.
2. Progression of the transformation:
i. The properties of an isolated sound tend to change continuously and slowly.
ii. The properties of a sequence of sounds arising from the same source tend to change slowly.
3. When a sounding object is vibrating at a repeated period, its vibrations give rise to an acoustic pattern with frequency components that are multiples of a common fundamental frequency.
4. “Common fate”: Most of modifications arising from an acoustic signal will affect all components of the resulting sound, identically and simultaneously.
The first general regularity is used by the auditory system through what Bregman calls the “old-plus-new” heuristic: when a spectrum suddenly becomes more complex, while holding its initial frequency components, it is interpreted by the nervous system as a continuation of a former signal to which is added a new signal.
The second regularity is based on two rules, of which the auditory system takes advantage.
They are related to the sequential modification of sounds of the environment.
The first rule concerns the “sudden transformation” of the acoustic properties of a signal, which are interpreted as the beginning of a new signal. It is guided by the old-plus-new heuristic. The suddenness of the spectral transformation acts as any other cue in scene analysis: the greater it is, the more it affects the grouping process. The second rule leads to “grouping by similarity.” Similarity of sounds is not well understood, but is related to the fundamental frequency, the timbre (spectrum shape) and spatial localization. Similarity (as well as proximity, which is related) is discussed in section 1.3.5.

Apparent Motion and Auditory Streaming

Körte formulated [K¨15] several laws about the impression of movement that we can get with a panel of electric light bulbs in sequence alternatively flashed. His third law states that when the distance between the lamps increases, it is necessary to slow down the alternation of flashes to keep a strong impression of motion. An experiment implying the switch at a sufficient speed of the lights of a device like the one depicted in figure 1.12a, according for example to the pattern 142536, should show an irregular motion between members of the two separate sets of lamps. But as the speed increases, the motion will appear to split into two separate streams, one occuring in the right triplet (123), and the other in the left one (456). This phenomenon occurs because the distance between lamps of each triplet is too great for a move between triplets to be plausible, as predicted by Körte’s law. We get exactly the same phenomenon of streaming in audition, when listening at high speed to the looping sequence presented in figure 1.12b: the heard pattern is not 361425 as it is at a lower speed, but is divided into two streams, 312 (corresponding to the low tones) and 645 (corresponding to the high tones). According to Körte’s law, with melodic motion taking the place of spatial motion, the distance in frequency between the two groups of tone is too great regarding the speed of movement between them.

The Principle of Exclusive Allocation

On the left side of figure 1.13, the part of the drawing at which the irregular form overlaps the circle (shown with a bold stroke on the right side of the figure) is generally seen as part of the irregular shape: it belongs to the irregular form. It can be seen as part of the circle with an effort. Be that as it may, the principle of “belongingness,” introduced by the Gestalt psychologists, designates the fact that a property is always a property of something.
This principle is linked to that of “exclusive allocation of evidence,” illustrated in figure 1.14, which states that a sensory element should not be used in more than one description at a time. Thus, on the figure, we can see the separating edges as exclusively allocated to the vase at the center, or to the faces at the sides, but never to both of them. So, this second principle corresponds to the belongingness one, with an unique allocation at a time.
These principles of vision can be applied to audition as well, as shown by the experiment by Bregman and Rudnicky [BR75] illustrated in figure 1.15. The task of the subject was to determine the order of the two target tones A and B: high-low or low-high. When the pattern AB is presented alone, the subject easily finds the correct order. But, when the two tones F (for “flankers”) are added, such that we get the pattern FABF, subjects have difficulty hearing the order of A and B, because they are now part of an auditory stream. However, it is possible to assign the F tones to a different perceptual stream than to that of the A and B tones, by adding a third group of tones, labeled C for “captors.”

Localization Cues in Auditory Scene Analysis

This section concerns the integration of the spatial cues in the auditory scene analysis (ASA) process. As explained by Bregman in [Bre94] (see section 1.3), the auditory system, in its grouping process, seems to act as a voting system based on heuristic criteria. Spatial cues form part of theses criteria, but do not necessarily overpower other heuristics when in conflict with them—for instance, we are able to segregate different voices from a monophonic record.
The sequential and simultaneous integration of spatial cues are discussed in separate parts, but because these two kinds of integration interact between them, a third part deals with this interaction. The last part treats the particular case of speech.

Interactions Between Sequential and Simultaneous Integrations

Despite the small influence of spatial cues on simultaneous integration, the streaming of sound elements over time are more strongly influenced by spatial location, particularly for speech intelligibility. Darwin and Hukin [DH00] asked subjects to report a target word contained in a target carrier phrase, while a second carrier phrase was presented simultaneously.
Thereby, two candidate target words were presented simultaneously during a time-aligned temporal gap present in both the target and competing carrier phrases. Despite the presence of grouping cues opposed to spatial cues, the subjects reported the target word that spatially matched the target phrase. Shinn-Cunningham [Shi05] proposed a simplistic view of how spatial cues affect auditory scene analysis:
1. Spatial cues do not influence grouping of simultaneous sources. Instead other sound features determine how simultaneous or near-simultaneous sounds are grouped locally in time and frequency, forming “snippets” of sound.
2. Once a sound snippet is formed, its spatial location is computed, based primarily on the spatial cues in the sound elements grouped into that snippet.
3. Sound snippets are then pieced together across time in a process that relies heavily on perceived location of the snippets.
However, it has been shown by Darwin and Hukin [DH97, DH98] that spatial cues can also influence simultaneous grouping when other grouping cues are ambiguous. With stimuli in which a target tone could logically fall into one of two streams, one a sequence of repeated tones and the other a simultaneous harmonic complex, they measured the degree to which the ambiguous target was heard as part of the harmonic complex. The results showed that when the spatial cues in the target and the harmonic complex matched, the tone was heard more prominently in the harmonic complex than when the spatial cues were uninformative. But this study also shows that spatial cues can influence grouping when top-down listener expectations also influence grouping: the results of a given stimulus depend on what subjects heard in past trials. In the same spirit, research conducted by Lee and Shinn-Cunningham [LSO05] provided evidence that the perceptual organization of a mixture depends on what a listener is attending to. In particular, with the previous tone paradigm and by changing which object a given listener was asked to attend to (holding the same stimuli), they found that there was no predictive relationship between the degree to which the target was in one auditory object and the degree to which it was out of the other.


Mid/Side Stereo Coding

Mid/Side Stereo Coding [JF92] is a form of matrixing used to reduce correlations between the two channels of a stereo signal. The principle is to transform the left and right channels into a sum channel, called mid channel, m[n] = 1 p2 (l[n] + r[n]), and a difference channel, called side channel, s[n] = 1 p2 (l[n] − r[n]), which carries the residuals. This way, the m and s channels are less correlated than the original l and r channels, and in particular, the entropy of s is reduced. At the decoding phase, by using the same operations on the transformed channels, the original stereo channels can be fully recovered, provided that no additional lossy coding has been used on either of the transformed channels (see section 2.4.1). M/S coding is one of the two stereo joining techniques, together with ISC (presented in section 2.5), and is usually applied for coding low frequencies, whereas ISC is used for high-frequency coding.

Meridian Lossless Packing

For the multi-channel audio case, in [GCLW99], Meridian Audio proposes an invertible matrixing technique, which reduces the inter-channel correlations prior to applying linear prediction on each channel separately. The result of this combination is called Meridian Lossless Packing (MLP) and is widely used for audio and video DVD, but also by Dolby’s AC-3 codec on Blu-ray discs. MLP typically provides a 2:1 compression for music content.

Matrixing Based on Channel Covariance

Yang et al. [YAKK03, YKK04] proposed a high bit-rate coding model based on interchannel redundancy removal called modified AAC with Karhunen-Loève transform (MAACKLT). In this method, the input channels are statistically decorrelated by applying a Karhunen-Loève Transform (KLT). The interest of this transform is that most of the energy is compacted into the first several resulting channels, allowing for significant data compression by entropy coding or by selecting the channels associated with the highest variances. Energy masking thresholds are computed based on the transformed signals, and their frequency components are then bit-quantified. As its name states, MAACKLT is designed to be incorporated into the AAC coding scheme [BBQ+97].
The Karhunen-Loève transform, also known as principal components analysis (PCA), is a linear transformation projecting data onto the eigenvector basis of their covariance. In our case, data are the n (correlated) channels of the signal, represented by the n × k matrix V (k is the number of samples of a temporal frame): V = [V (1), . . . , V (i), . . . , V (k)], with V (i) = [x1, x2, . . . , xn]T.

Parametric Spatial Audio Coding

So far, in matrixing techniques, the effort was put on trying to transmit all the channels by combining them or by eliminating redundancies between them. A rather different approach, presented in this section, is to transmit a parametric representation of the spatial attribute of the sound scene. As depicted in figure 2.4, the general idea is to use a time-frequency representation to extract from the input channels a set of “spatial” parameters describing the spatial organization of the scene/channels, on the one hand (see section 2.5.1), and to group all these channels using a downmixing technique (see section 2.5.2) to form a single mono or stereo signal thereby reducing inter-channel redundancies, on the other hand. The spatial parameters are then used in the decoding phase (see section 2.5.3) to reconstruct from the downmix an approximation of the original channels, or even to generate a new set of channels adapted to another loudspeaker setup.
These schemes especially rely on the supposition that the transmission of the parameters takes only a few kbits/s, which is very small compared to the bit rate dedicated to the audio channel(s). This means that the quantization process of these parameters has to be performed carefully to ensure a reliable spatial representation with only a few bits (see section 2.5.4). Besides, as already evoked in section 2.4.1, if the downmix is perceptually encoded prior to transmission, binaural unmasking has to be taken into account to avoid noise unmasking [tK96].

Table of contents :

1 Background on Hearing 
1.1 Ear Physiology
1.1.1 Body and Outer Ear
1.1.2 Middle Ear
1.1.3 Inner Ear
1.2 Integration of Sound Pressure Level
1.2.1 Hearing Area and Loudness
1.2.2 Temporal and Frequency Masking Phenomena
1.2.3 Critical Bands
1.3 Auditory Scene Analysis
1.3.1 General Acoustic Regularities used for Primitive Segregation
1.3.2 Apparent Motion and Auditory Streaming
1.3.3 The Principle of Exclusive Allocation
1.3.4 The Phenomenon of Closure
1.3.5 Forces of Attraction
1.4 Spatial Hearing
1.4.1 Localization in Azimuth
1.4.2 Localization in Elevation
1.4.3 Localization in Distance
1.4.4 Apparent Source Width
1.4.5 Localization Performance
1.4.6 Binaural Unmasking
1.5 Localization Cues in Auditory Scene Analysis
1.5.1 Sequential Integration
1.5.2 Simultaneous Integration
1.5.3 Interactions Between Sequential and Simultaneous Integrations .
1.5.4 Speech-Sound Schemata
1.6 Conclusions
2 State-of-the-Art of Spatial Audio Coding 
2.1 Representation of Spatial Audio
2.1.1 Waveform Digitization
2.1.2 Higher-Order Ambisonics
2.2 Coding of Monophonic Audio Signals
2.2.1 Lossless Coding
2.2.2 Lossy Coding
2.3 Lossless Matrixing
2.3.1 Mid/Side Stereo Coding
2.3.2 Meridian Lossless Packing
2.4 Lossy Matrixing
2.4.1 Perceptual Mid/Side Stereo Coding
2.4.2 Matrix Encoding
2.4.3 Matrixing Based on Channel Covariance
2.5 Parametric Spatial Audio Coding
2.5.1 Extraction of the Spatial Parameters
2.5.2 Computation of the Downmix Signal
2.5.3 Spatial Synthesis
2.5.4 Quantization of the Spatial Parameters
2.6 Conclusions
3 Spatial Blurring 
3.1 Motivations
3.2 Terms and Definitions
3.3 Paradigm for MAA Assessment
3.4 Stimuli
3.5 Subjects, Rooms and General Setup
3.6 Experiment 1: Spatial Blurring From One Distracter
3.6.1 Setup
3.6.2 Procedure
3.6.3 Data Analysis and Results
3.7 Experiment 2: Effect of the Signal-to-Noise Ratio
3.7.1 Setup
3.7.2 Procedure
3.7.3 Tasks
3.7.4 Adaptive Method Setup
3.7.5 Data Analysis and Results
3.8 Experiment 3: Effect of the Distracter Position
3.8.1 Setup
3.8.2 Procedure
3.8.3 Data Analysis and Results
3.8.4 Validity of Our Experimental Protocol
3.9 Experiment 4: Interaction Between Multiple Distracters
3.9.1 Setup
3.9.2 Procedure
3.9.3 Data Analysis and Results
3.10 Summary and Conclusions
4 Towards a Model of Spatial Blurring and Localization Blur 
4.1 Assumptions
4.2 Formalism and Overview
4.3 Computation of Masking Thresholds
4.4 Reference Value of Spatial Blurring
4.5 Accounting for the Effect of SNR
4.6 Additivity of Distracters
4.7 Resulting Localization Blur
4.8 Simplification of the Model
4.9 Conclusions
5 Multi channel Audio Coding Based on Spatial Blurring 
5.1 Dynamic Bit Allocation in Parametric Schemes
5.1.1 Principle Overview
5.1.2 Use of our Psychoacoustic Model of Spatial Blurring
5.1.3 Bit Allocation of the Spatial Parameters
5.1.4 Transmission and Bitstream Unpacking
5.1.5 Informal Listening
5.2 Dynamic Truncation of the HOA Order
5.2.1 Spatial Distortions Resulting from Truncation
5.2.2 Principle Overview
5.2.3 Modes Of Operation
5.2.4 Time-Frequency Transform
5.2.5 Spatial Projection
5.2.6 Spatio-Frequency Analysis
5.2.7 Psychoacoustic Model
5.2.8 Space Partitioning
5.2.9 Space Decomposition
5.2.10 HOA Order Truncation
5.2.11 Bit-Quantization by Simultaneous Masking
5.2.12 Bitstream Generation
5.2.13 Decoding
5.3 Conclusions
A Instructions Given to the Subjects 
A.1 Left-Right/Right-Left Task
A.2 Audibility Task
B Study of Inter- and Intra-subject variability 
C Audio Coding Based on Energetic Masking 
C.1 Principle Overview
C.2 Quantization Errors
C.3 Modeling Simultaneous Masking Curves
C.4 Computation of Masking Curves
C.5 Bit Allocation Strategies
C.5.1 A Simple Allocation Method by Thresholding
C.5.2 Perceptual Entropy
C.5.3 Optimal Bit Allocation for a Fixed Bitrate
C.6 Bitstream Format


Related Posts