Apparent Motion and Auditory Streaming
Körte formulated [K¨15] several laws about the impression of movement that we can get with a panel of electric light bulbs in sequence alternatively flashed. His third law states that when the distance between the lamps increases, it is necessary to slow down the alternation of flashes to keep a strong impression of motion. An experiment implying the switch at a suﬃcient speed of the lights of a device like the one depicted in figure 1.12a, according for example to the pattern 142536, should show an irregular motion between members of the two separate sets of lamps. But as the speed increases, the motion will appear to split into two separate streams, one occuring in the right triplet (123), and the other in the left one (456). This phenomenon occurs because the distance between lamps of each triplet is too great for a move between triplets to be plausible, as predicted by Körte’s law. We get exactly the same phenomenon of streaming in audition, when listening at high speed to the looping sequence presented in figure 1.12b: the heard pattern is not 361425 as it is at a lower speed, but is divided into two streams, 312 (corresponding to the low tones) and 645 (corresponding to the high tones). According to Körte’s law, with melodic motion taking the place of spatial motion, the distance in frequency between the two groups of tone is too great regarding the speed of movement between them.
The Principle of Exclusive Allocation
On the left side of figure 1.13, the part of the drawing at which the irregular form overlaps the circle (shown with a bold stroke on the right side of the figure) is generally seen as part of the irregular shape: it belongs to the irregular form. It can be seen as part of the circle with an eﬀort. Be that as it may, the principle of “belongingness,” introduced by the Gestalt psychologists, designates the fact that a property is always a property of something.
This principle is linked to that of “exclusive allocation of evidence,” illustrated in figure 1.14, which states that a sensory element should not be used in more than one description at a time. Thus, on the figure, we can see the separating edges as exclusively allocated to the vase at the center, or to the faces at the sides, but never to both of them. So, this second principle corresponds to the belongingness one, with an unique allocation at a time.
These principles of vision can be applied to audition as well, as shown by the experiment by Bregman and Rudnicky [BR75] illustrated in figure 1.15. The task of the subject was to determine the order of the two target tones A and B: high-low or low-high. When the pattern AB is presented alone, the subject easily finds the correct order. But, when the two tones F (for “flankers”) are added, such that we get the pattern FABF, subjects have diﬃculty hearing the order of A and B, because they are now part of an auditory stream. However, it is possible to assign the F tones to a diﬀerent perceptual stream than to that of the A and B tones, by adding a third group of tones, labeled C for “captors.”
24 Chapter 1. Background on Hearing
Figure 1.15: The tone sequence used by Bregman and Rudnicky to underline the exclusive allocation of evidence. (Reprinted from [Bre94].)
Figure 1.16: An example of the closure phenomenon. Shapes are strong enough to complete evidences with gaps in them. (After [Fou11].)
When the C tones were close to the F tones in frequency, the latter were captured 3 into a stream CCCFFCC by the former, and the order of AB was clearer than when C tones were much lower than F tones. Thus, when the belongingness of the F tones is switched, the perceived auditory streams are changed.
The Phenomenon of Closure
Also proposed by the Gestalt psychologists, the phenomenon of closure represents the ten-dency to close certain “strong” perceptual forms such as circles or squares, by completing evidences with gaps in them. Examples can be seen on the left of figure 1.13 and in figure 1.16.
However, when the forces of closures are not strong enough, as shown in figure 1.17, the presence of the mask could be necessary to provide us informations about which spaces have been occluded, giving us the ability to discriminate the contours that have been pro-duced by the shape of the fragments themselves from those that have been produced by the shape of the mask that is covering them. This phenomenon is called the phenomenon of “perceived continuity,” and has an equivalent in audition. Figure 1.18 presents an exper-iment where an alternatively rising and falling pure-tone glide is periodically interrupted. In this case, several short rising and falling glides are heared. But in the presence of a loud burst of broad-band noise exactly matching the silences, a single continuous sound is heard. Note that to be successful, the interrupting noise must be loud enough and have the right frequency content, corresponding to the interrupted portion of the glide. This is also an illustration of the old-plus-new heuristic (see section 1.3.1).
Auditory Scene Analysis
Figure 1.17: Objects occluded by a masker. On the left, fragments are not in good continuation with one another, but with the presence of the masker, on the right, we get informations about occlusion, and then fragments are grouped into objects. (After [Bre94].)
Forces of Attraction
Let’s make another analogy with vision. In figure 1.19, two processes of perceptual organization are highlighted. The first one (shown on the left side of the figure) concerns the similarity grouping: because of the similarity of color, and thus of the contrast between the black and white blobs, two clusters appear, as in audition when sounds of similar timbre group together.
The second process of perceptual organization is about grouping by proximity and is shown on the right side of the figure. Here, the black blobs fall into two separate clusters, because each member of one cluster is closer to its other members than to those of the other one. This Gestalt law has a direct analogy in audition. In figure 1.20, an experiment is illustrated, in which two sets of tones, one high and the other low in frequency, are shuﬄed together. As visually when looking at the figures, the listening of the third one (on the right) will show greater perceptual segregation than the second, and the second than the first.
Thereby, forces of attraction are applying to perceptually group objects together, the most important being the time and frequency proximity in audition (corresponding to distance in vision). Note that indeed, only one factor—time—is really implied in sound proximity, since frequency is highly related to time. So, two kinds of perceptual group-ing coexist: one “horizontal,” the sequential grouping, related to time and melody, and one “vertical,” the simultaneous grouping, related to frequency and harmony ; and these grouping factors interact between them since time is implied in both. But what if these forces are contrary? An experiment [BP78] by Bregman and Pinker discuss this and is displayed in figure 1.21. It consist of a repeating cycle formed by three pure tones A, B and C arranged in such a way that A and B tones frequencies are grossly in the same area, as well as B and C are roughly synchronous. The experiment showed that it was possible to hear the sequence in two diﬀerent ways. In the first one, A and B tones are streamed together, depending on their proximity in frequency. And as a second way, B and C tone are fused in a complex sound if their synchrony is suﬃcient. It was as if A and C were competing to see which one would get to group with B.
Finally, our perception system tries to integrate these grouping laws in order to build a description of the scene. Though, the built of this description is not always totally right, as shown by an illusion set up by Diana Deutsch [Deu74a]. The listener is presented with a continuously repeating alternation of two events. The first event is a low tone presented to the left ear synchronously with a high tone (one octave above) to the right ear. The second event is the reverse: left/high and right/low. However, many listeners described another experience. They heard a single sound bouncing back and forth between the ears, and alternating between high and low pitch. The explanation comes from the fact that, assuming the existence of a single tone, the listeners derived of it two diﬀerent descriptions from two diﬀerent types of perceptual analyzes, and put them together in a wrong way.
The auditory system is able, even when sight is not available, to derive more or less precisely a position of sound sources in three dimensions, thanks to its two auditory sensors, the ears. A spherical coordinate system is useful to represent each sound source of the auditory scene with three coordinates relative to the center of the listener’s head: the azimuth, the elevation, and the distance, thereby defining the three dimensions of sound localization. Localization in azimuth (see section 1.4.1) is mainly attributed to a binaural processing of cues, based on the integration of time and intensity diﬀerences between ear inputs, whereas localization in elevation (see section 1.4.2) is explained by the use of monaural cues. Although, monaural cues play a role as well in localization in azimuth. Localization in distance (see section 1.4.3) is more related to characteristics of the sources, like spectral content and coherence.
Localization in Azimuth
As illustrated in figure 1.22, from a physical point of view, if one considers a single monochromatic sound source, the incident sound wave will directly reach the closest ear (the ipsilateral ear). Before reaching the other ear (the contralateral ear), the head of the listener constitutes an obstacle to the wave propagation, and depending on its wavelength, the wave is subject to be partly diﬀracted and partly reflected by the head of the listener. The greater distance of the contralateral ear from the sound source, in conjunction with the diﬀraction of the wave by the head, induces a delay between the time of arrival of the wave to each ear, namely an interaural time diﬀerence (ITD). The reflection by the head attenuates the wave before reaching the contralateral ear, resulting in an interaural inten-sity diﬀerence (IID), also known as interaural level diﬀerence (ILD). The duplex theory, proposed by Lord Rayleigh [Ray07] in 1907, states that our lateralization ability (local-ization along only one dimension, the interaural axis) is actually based on the integration of these interaural diﬀerences. It has been confirmed by more recent studies [Bla97] that indeed ITDs and ILDs are used as cues to derive the position of sound sources in azimuth.
Neural processing of interaural diﬀerences
This section aims to bring a physiological justification of the binaural cues ITD and ILD. In the continuity of section 1.1 describing the ear physiology, the inner hair cells of the organ of Corti convert the motions occuring along the basilar membrane into electrical impulses which are transmitted to the brain through the auditory (or cochlear) nerve. Hence, each fiber in the nerve is related to a particular band of frequencies from the cochlea and has a particular temporal structure depending on impulses through the fiber. When phase-locking is eﬀective (that is for low frequencies [PR86]), discharges through the fiber occur within a well-defined time window relative to a single period of the sinusoid. The signal coming from the cochlea passes through several relays in the auditory brainstem (see figure 1.23) before reaching the auditory cortex. At each relay, the initial tonotopic coding from the cochlea is projected, as certain neurons respond principally to components close to their best frequency. Note that the two parts (left ear and right ear) of this brainstem are interconnected, allowing for binaural processing of information. The center that interests us is the superior olivary complex (SOC). In most mammals, two major types of binaural neurons are found within this complex.
In 1948, Jeﬀress [Jef48] proposed a model of ITD processing which is consistent with more recent studies [JSY98, MJP01]. A nucleus of the SOC, the medial superior olive (MSO), hosts cells designated as excitatory-excitatory (EE) because they receive excitatory input from the cochlear nucleus (CN) of both sides. An axon from one CN and an axon from the contralateral CN are then connected to an EE cell. An EE cell is a “coincidence detector” neuron: its response is maximum for simultaneous inputs. Each axon from a CN having its own conduction time, this CN-EE-CN triplet is sensitive to a particular ITD, and the whole set of such triplets finally acts as a cross-correlator. Consequently, phase-locking is an essential prerequisite for this process to be eﬀective.
The second type of binaural neuron [Tol03] is a subgroup of cells of a nucleus of the SOC, the lateral superior olive (LSO), which are excited by the signals from one ear and inhibited by the signals from the other ear, and thus are designated as excitation-inhibition (EI) type. To do so, the signal coming from the contralateral CN is presented to the medial nucleus of the trapezoid body (MNTB), another nucleus of the SOC, which makes it inhibitory and presents it to the LSO. Also, the LSO receives an excitatory signal from the ipsilateral CN, and thus acts as a subtractor of patterns from the two ears. The opposite influence of the two ears makes these cells sensitive to ILD. It is also believed that cells from the LSO are involved in the extraction of ITDs by envelope coding [JY95].
The processing from MSO and LSO is then transmitted to the inferior colliculus (IC), where further processing takes place before transmission to the thalamus and the auditory cortex (AC).
Validity of interaural diﬀerences
Depending on frequency, ITD and ILD cues will be more or less exploitable by the auditory system. At low frequencies, where the wavelength is important compared to the head radius, the sound wave is reflected to a negligible degree, and thus the ILD is almost nil. 4 As the wave frequency increases, the wavelength gets smaller with respect to the head radius, and the reflected part of the sound wave increases, until being completely reflected at high frequencies. By definition, the ILD equals:
where pL and pR are respectively the acoustic pressure on the left and right eardrum. At low frequencies, the ITD can be equally described as an interaural phase diﬀerence (IPD), approximated by [Kuh77]:
with k = 2π/λ being the wave number, a the head radius (modeled as a rigid sphere), λ the wavelength, c the sound speed in air, and θ the source azimuth. Under the assumption that (ka)2 1, the IPD can be expressed as an ITD independent of frequency:
4. However, in the proximal region (i.e., within one meter of the listener’s head), because the sound wave can no longer be treated as planar but as spherical, and thus because of the inverse relationship of sound pressure and distance, the diﬀerence distance between each ear and the source implies a significant diﬀerence in pressure between the ears, even if no head shadowing occurs [SSK00].
where ω is the angular velocity. Above 1500 Hz, the wavelengths fall below the interaural distance, which is about 23 cm, and thus delays between the ears can exceed a period of the wave and become ambiguous. Moreover, the phase-locking ability of the auditory system (that is its ability to encode the phase information in the auditory nerve and neurons) decreases with increasing stimulus frequency [PR86], and is limited to frequencies below 1.3 kHz [ZF56]. However, it has been shown [MP76] that when listening to a signal at one ear, and to an envelope-shifted version of it at the other ear, the ITD is still eﬀective, even if the carrier is above 1500 Hz (but this ability decreases for frequencies above 4 kHz [MG91]). In that case, the ITD is well described by Woodworth’s model [WS54], which is independent of frequency:
So, two mechanisms are involved in the integration of the ITD: phase delay at low fre-quencies, and envelope delay at higher frequencies. Note that formulas (1.8) and (1.9) are assuming a spherical head (which is actually more oval) and ears at ±90◦ (which are actually a few degrees backward). Moreover, these models do not depend on elevation, which implies that cones of constant azimuth share the same ITD value. A more realistic model has been designed by Busson [Bus06].
The formulas above for ILD and ITD are physical descriptions of interaural cues, but the way these cues are integrated by the auditory system is still unclear. It is generally accepted given the neural processing of interaural diﬀerences, that ITD and ILD are pro-cessed in narrow bands by the brain before being combined with information from other modalities (vision, vestibules, etc.) to derive the position of the identified auditory ob-jects (see section 1.5). Besides, additional results support a frequency-specific encoding of sound locations (see section 1.5.2). Such processing in narrow bands ensures the ability to localize concurrent sound sources with diﬀerent spectra. Therefore, the integration of interaural diﬀerences is often simulated by processing the two input channels through filter banks and by deriving interaural diﬀerences within each pair of sub-bands [BvdPKS05].
Finally, the ITD and the ILD are complementary: in most cases, at low frequencies (below 1000 Hz), the ITD gives most informations about lateralization, and roughly above this threshold, the ITD becoming ambiguous, the ILD reliability is increasing, to take the lead above about 1500 kHz. Using synthetic and conflicting ITD and ILD, Wightman and Kistler [WK92] showed that the ITD is prominent for the low-frequency lateralization of a wide-band sound. However, for a sound without any low frequencies (below 2500 Hz), the ILD is prevailing. Gaik showed [Gai93] that conflicting cues induce artifacts of localization and modifications of the perception of tone color.
Limitations of ITD and IID
Assuming the simple models of ILD and ITD described above, these two cues do not depend on frequency, and especially they do not depend on elevation either. Hence, particular loci, called “cones of confusion”, were introduced by Woodworth; they are centered on the interaural axis and correspond to an infinite number of positions for which the ITD and ILD are constant (see figure 1.24). Actually, these cones do not stricly make sense, and would rather in reality correspond to a set of points of equal ITD/ILD pair. Indeed, ITD and ILD are more complex than simple models, and measured iso-ITD or iso-ILD curves are not strictly cone-shaped (see them on figure 1.24b). Anyhow, the necessary threshold to detect small variations of position (see section 1.4.5) increases the number of points of equal ITD/ILD pair. The term “cones of confusion” can also refer to a single binaural cue, ITD or ILD, considering the iso-ITD or the iso-ILD curves only.
Figure 1.24: (a) Iso-ITD (left side) and iso-IID (right side) contours in the horizontal plane, in the proximal region of space. In the distal region, however, iso-ITD and iso-IID surfaces are similar. (After [SSK00].) (b) Superimposition of measured iso-ITD (in red, quite regular and roughly concentric, steps of 150 µs) and iso-ILD (in blue, irregular, steps of 10 dB) curves in the distal region. (Reprinted from [WK99].)
Thus, the duplex theory does not explain our discrimination ability along these “cones of confusion,” which implies a localization in elevation and in distance (also with an extracranial perception). The asymetrical character of the ILD could be a front/back dis-crimination cue. However that may be, these limitations suggested the existence of other localization cues, the monaural spectral cues, which are discussed in the next section. It has also been shown that monaural cues intervene in localization in azimuth for some con-genitally monaural listeners [SIM94], and thus might be used as well by normal listeners. It is especially believed that monaural cues are used to reduce front/back confusions.
Localization in Elevation
So far, the presented localization cues, based on interaural diﬀerences, were not suﬃcient to explain the discrimination along cones of confusion. Monaural cues (or spectral cues) put forward an explanation based on the filtering of the sound wave of a source, due to reflections and diﬀractions by the torso, the shoulders, the head and the pinnae before reaching the tympanic membrane. The resulting colorations for each ear of the source spectra, depending on both direction and frequency, could be a localization cue.
Assuming that xL(t) and xR(t) are the signals of the left and right auditory canal inputs of a x(t) source signal, this filtering can be modeled as:
where hL and hR designate the impulse responses of the wave propagation from the source to the left and right auditory canals, and thus the previously mentioned filtering phenom-ena. Because of the direction-independent transfert functions from the auditory canals to the eardrums, these are not included in hL and hR. The equivalent frequency domain filtering model is given by:
The HL and HR filters are called head-related transfer functions (HRTF), whereas hL and hR are called head-related impulse responses (HRIR). The postulate behind localization in elevation is that this filtering induces peaks and valleys in XL and XR (the resulting spectra of xL and xR) varying with the direction of the source as a “shape signature”, especially in high frequencies [LB02]. The auditory system would first learn these shape signatures, and then use this knowledge to associate a recognized shape with its corre-sponding direction (especially in elevation). Consequently, this localization cue requires a certain familiarity with the original source to be eﬃcient, especially in the cases of static sources with no head movements. In the case of remaining confusions of source position in cones of constant azimuth, due to similar spectral contents, the ambiguity can be solved by left/right and up/down head movements [WK99]. Also, these movements improve lo-calization performance [Wal40, TMR67]. Actually, a slow motion of the source in space is suﬃcient to increase localization performance, implying that the knowledge of the relative movements is not necessary.
Localization in Distance
The last point in localization deals with the remaining coordinate of the spherical system: the distance. The available localization cues are not very reliable, which is why our perception of distance is quite imprecise [Zah02]. Four cues are involved in distance perception [Rum01]. First, the perceived proximity increases with the source sound level.
Second, the direct field to reverberated field energy ratio gets high values for closer sources, and this ratio is assessed by the auditory system through the degree of coherence between the signals at the two ears. Third, the high frequencies are attenuated with air absorption, thus distant sources have less high frequency content. And finally, further away sources have less diﬀerence between arrival of direct sound and floor first reflections.
Apparent Source Width
The apparent source width (ASW) has been studied for the acoustics of concert halls and deals with how large a space a source appears to occupy from a sonic point of view. It is related to interaural coherence (IC) for binaural listening or to inter-channel coherence (ICC) for multichannel reproduction, which are defined as the maximum absolute value of the normalized cross-correlation between the left (xl) and right (xr) signals:
When IC = 1, the signals are coherent, but may have a phase diﬀerence (ITD) or an intensity diﬀerence (ILD), and when IC = 0 the signals are independent. Blauert [Bla97] studied the ASW phenomenon with white noises and concluded that when IC = 1, the ASW is reduced and confined to the median axis; when IC is decreasing, the apparent width increases until the source splits up into two distinct sources for IC = 0.
The estimation by the auditory system of sound source attributes (as loudness, pitch, spatial position, etc.) may diﬀer to a greater or lesser extent from the real characteristics of the source. This is why one usually diﬀerentiates sound events (physical sound sources) from auditory events (sound sources as perceived by the listener) [Bla97]. Note that a one-to-one mapping does not necessarily exist between sound events and auditory events. The association between sound and auditory events is of particular interest in what is called auditory scene analysis (ASA, see section 1.3) [Bre94].
Localization performance covers two aspects. Localization error is the diﬀerence in position between a sound event and its (supposedly) associated auditory event, that is to say the accuracy with which the spatial position of a sound source is estimated by the auditory system. Localization blur is the smallest change in position of a sound event that leads to a change in position of the auditory event, and thereby is a measure of sensitivity. It reflects the extent to which the auditory system is able to spatially discriminate two positions of the same sound event, that is the auditory spatial resolution. When it charac-terizes the sensitivity to an angular displacement (either in azimuth or in elevation), the localization blur is sometimes expressed as a minimum audible angle (MAA). MAAs in azimuth constitute the main topic of this thesis and are especially studied in chapter 3. We will see in the following that both localization error and localization blur mainly depend on two parameters characterizing the sound event: its position and its spectral content.
Localization errors have been historically studied by Lord Rayleigh [Ray07] using vi-brating tuning forks, after which several studies followed. Concerning localization in az-imuth, studies [Pre66, HS70] depicted in figure 1.25 using white noise pulses have shown that localization error is the smallest in the front and back (about 1◦ ), and much greater for lateral sound sources (about 10◦). Carlile et al. [CLH97] reported similar trends using broadband noise bursts.
Figure 1.25: Localization error and localization blur in the horizontal plane with white noise pulses [Pre66, HS70]. (After [Bla97].)
azimuth is almost independent of elevation. Localization blur in azimuth follows the same trend as localization error (and is also depicted in figure 1.25), but is slightly worse in the back (5.5◦) compared to frontal sources (3.6°), and reaches 10◦ for sources on the sides. Perrott [PS90], using click trains, got a smaller mean localization blur of 0.97◦. The frequency dependence of localization error in azimuth can be found for example in [SN36], where maximum localization errors were found around 3 kHz using pure tones, these values declining for lower and higher frequencies. This type of shape for localization performance in azimuth as a function of frequency, showing the largest values at mid frequencies, is characteristic for both localization error and localization blur and is usually interpreted as an argument supporting the duplex theory. Indeed, in that frequency range (1.5 kHz to 3 kHz), the frequency is too high for phase locking to be eﬀective (which is necessary for ITD), and the wavelength is too long for head shadowing to be eﬃcient (which is necessary for ILD), thereby reducing the available localization cues [MG91]. The frequency depen-dence of localization blur has been studied by Mills [Mil58] (see figure 1.26) and varies between 1◦ and 3◦ for frontal sources. Boerger [Boe65] got similar results using Gaus-sian tone bursts of critical bandwidth. Important front/back confusions for pure tones, compared to broadband stimuli, are reported in [SN36]. Moreover, the study from Carlile et al. [CLH97] confirms that only a few front/back confusions are found with broadband noise. This is a characteristic result concerning narrow band signals, given that monaural cues are poor in such cases and cannot help discriminate the front from the back by signal filtering.
Concerning localization error in elevation, several studies report worse performance than in azimuth. Carlile et al. [CLH97] reported 4◦ on average with broadband noise bursts. Damaske and Wagener [DW69], using continuous familiar speech, reported local-ization error and localization blur in the median plane that increases with elevation (see figure 1.27). Oldfield and Parker [OP84], however, announce an error independent of the elevation. Blauert [Bla70] reported a localization blur of about 17° for forward sources, which is much more than Damaske and Wagener’s estimation (9°), but using unfamiliar speech.
Figure 1.26: Frequency dependence of localization blur in azimuth (expressed here as a “min-imum audible angle”) using pure tones, as a function of the sound source azimuth position θ. (After [Mil58].)
Figure 1.27: Localization error and localization blur in the median plane using familiar speech [DW69]. (After [Bla97].)
Figure 1.28: Localization error and localization blur for distances using impulsive sounds [Hau69]. (After [Bla97].)
obtain optimal localization performance in elevation. Perrott [PS90], using click trains, got a mean localization blur of 3.65◦. He also performed measures of localization blur for oblique planes, and reported values below 1.24◦ as long as the plane is rotated more than 10◦ away from the vertical plane. Grantham [GHE03], on the contrary, reported a higher localization blur for a 60◦ oblique plane than for the horizontal plane, but still a lower blur than for the vertical plane. Blauert [Bla68] brought to light an interesting phenomenon concerning localization in the median plane of narrow-band signals (bandwidth of less than 2/3 octave): the direction of the auditory event does not depend on the direction of the sound event, but only on the frequency of the signal. Once again, this is justified by the fact that monaural cues are non-existent for such narrow-band signals.
Finally, studies dealing with localization in distance suggest that performance depends on the familiarity of the subject with the signal. Gardner [Gar69] studied localization in the range of distance from 0.9 to 9 m with a human speaker whispering, speaking normally, and calling out loudly. For normal speaking, performance is excellent, whereas distances for whispering and calling out voices are under- and an over-estimated, respectively. Good performance for the same range of distances was reported by Haustein [Hau69] using impul-sive sounds (see figure 1.28), but using test signals that were demonstrated beforehand from diﬀerent distances. On the contrary, Zahorik [Zah02] reports a compression phe-nomenon: sources closer than one meter are over-estimated, whereas far-away sources are under-estimated.
Beside the sensitivity to a physical displacement of a sound source, which is characterized by the notion of localization blur, some research has studied the sensitivity to the cues underlying the localization process, namely ITD, ILD, IC, and spectral cues. In this section, we will focus on results concerning the sensitivity to changes in the binaural cues only (ITD, ILD and IC). This can be measured by manipulating localization cues to generate a pair of signals (specific to each ear) and playing them over headphones to a listener. For instance, an artificial ITD can be introduced between the two signals to test the smallest variation of ITD a listener is able to detect, i.e., the just-noticeable diﬀerence (JND) of ITD. The sensitivity to a given cue can potentially depend on the following main parameters: the reference value of this cue (the initial value from which the sensitivity to a slight variation is tested), the frequency (content) of the stimulus, the level of the stimulus (apart from the potential presence of an ILD), and the actual values of the other localization cues.
For stimuli with frequencies below 1.3 kHz, the ITD can be described as a phase diﬀerence (IPD, see section 1.4), and for a null reference IPD, the JND of IPD does not depend on frequency and equals about 0.05 rad [KE56]. This sensitivity tends to increase as the reference IPD increases [Yos74, HD69]. There does not seem to be an eﬀect of the stimulus level on the JND of ITD [ZF56], although there is a decrease in sensitivity for very low levels [HD69]. Finally, the sensitivity to a change of ITD decreases with an increasing ILD [HD69].
For a null reference ILD, the JND of ILD is relatively independent of stimulus level [Yos72] (except again for very low levels [HD69]) and of stimulus frequency [Gra84]. As the reference ILD increases, the sensitivity to a change of ILD decreases [Mil60, RT67, YH87]: from between 0.5 and 1 dB for a reference ILD of 0 dB, to between 1.5 and 2 dB for a reference ILD of 15 dB. The dependence of ILD sensitivity on the reference ITD is not clear, since Yost [Yos72] reported that this sensitivity increases when the ITD increases, whereas no dependence is reported in [HD69].
Concerning the JND of IC, a strong dependence on the reference IC has been shown [RJ63, LJ64, GC81, CCS01]: from 0.002 for a reference IC of +1, to about a 100 times larger necessary variation for a reference IC of 0. The sensitivity to IC change does not depend on stimulus level [HH84], except for low levels.
Table of contents :
1 Background on Hearing
1.1 Ear Physiology
1.1.1 Body and Outer Ear
1.1.2 Middle Ear
1.1.3 Inner Ear
1.2 Integration of Sound Pressure Level
1.2.1 Hearing Area and Loudness
1.2.2 Temporal and Frequency Masking Phenomena
1.2.3 Critical Bands
1.3 Auditory Scene Analysis
1.3.1 General Acoustic Regularities used for Primitive Segregation
1.3.2 Apparent Motion and Auditory Streaming
1.3.3 The Principle of Exclusive Allocation
1.3.4 The Phenomenon of Closure
1.3.5 Forces of Attraction
1.4 Spatial Hearing
1.4.1 Localization in Azimuth
1.4.2 Localization in Elevation
1.4.3 Localization in Distance
1.4.4 Apparent Source Width
1.4.5 Localization Performance
1.4.6 Binaural Unmasking
1.5 Localization Cues in Auditory Scene Analysis
1.5.1 Sequential Integration
1.5.2 Simultaneous Integration
1.5.3 Interactions Between Sequential and Simultaneous Integrations
1.5.4 Speech-Sound Schemata
2 State-of-the-Art of Spatial Audio Coding
2.1 Representation of Spatial Audio
2.1.1 Waveform Digitization
2.1.2 Higher-Order Ambisonics
2.2 Coding of Monophonic Audio Signals
2.2.1 Lossless Coding
2.2.2 Lossy Coding
2.3 Lossless Matrixing
2.3.1 Mid/Side Stereo Coding
2.3.2 Meridian Lossless Packing
2.4 Lossy Matrixing
2.4.1 Perceptual Mid/Side Stereo Coding
2.4.2 Matrix Encoding
2.4.3 Matrixing Based on Channel Covariance
2.5 Parametric Spatial Audio Coding
2.5.1 Extraction of the Spatial Parameters
2.5.2 Computation of the Downmix Signal
2.5.3 Spatial Synthesis
2.5.4 Quantization of the Spatial Parameters
3 Spatial Blurring
3.2 Terms and Definitions
3.3 Paradigm for MAA Assessment
3.5 Subjects, Rooms and General Setup
3.6 Experiment 1: Spatial Blurring From One Distracter
3.6.3 Data Analysis and Results
3.7 Experiment 2: Effect of the Signal-to-Noise Ratio
3.7.4 Adaptive Method Setup
3.7.5 Data Analysis and Results
3.8 Experiment 3: Effect of the Distracter Position
3.8.3 Data Analysis and Results
3.8.4 Validity of Our Experimental Protocol
3.9 Experiment 4: Interaction Between Multiple Distracters
3.9.3 Data Analysis and Results
3.10 Summary and Conclusions
4 Towards a Model of Spatial Blurring and Localization Blur
4.2 Formalism and Overview
4.3 Computation of Masking Thresholds
4.4 Reference Value of Spatial Blurring
4.5 Accounting for the Effect of SNR
4.6 Additivity of Distracters
4.7 Resulting Localization Blur
4.8 Simplification of the Model
5 Multichannel Audio Coding Based on Spatial Blurring
5.1 Dynamic Bit Allocation in Parametric Schemes
5.1.1 Principle Overview
5.1.2 Use of our Psychoacoustic Model of Spatial Blurring
5.1.3 Bit Allocation of the Spatial Parameters
5.1.4 Transmission and Bitstream Unpacking
5.1.5 Informal Listening
5.2 Dynamic Truncation of the HOA Order
5.2.1 Spatial Distortions Resulting from Truncation
5.2.2 Principle Overview
5.2.3 Modes Of Operation
5.2.4 Time-Frequency Transform
5.2.5 Spatial Projection
5.2.6 Spatio-Frequency Analysis
5.2.7 Psychoacoustic Model
5.2.8 Space Partitioning
5.2.9 Space Decomposition
5.2.10 HOA Order Truncation
5.2.11 Bit-Quantization by Simultaneous Masking
5.2.12 Bitstream Generation