Methods for producing individualised HRTFs

Get Complete Project Material File(s) Now! »

auditory cues to sound location

The human auditory system has adapted to make use of the most salient acoustic cues from sound sources in our environment. Unlike other spatial senses such as vision where there is a topographic projection from receptor epiphelia into the central nervous system, the auditory system encodes the amplitude of the energy entering the ears as a function of frequency. Dif-ferences between the energy at the two ears is translated into information about the sound source’s location. As explained by Blauert (1997) … the system does not use every detail of the complicated interau-ral dissimilarities, but rather derives what information is needed from definite, easily recognisable attributes.
It is also important to note that the auditory system uses a variety of different auditory cues depending on the environment (such as a reverberant room for example), which will be discussed below and detailed in chapter 4.
Figure 1: Representation of the two coordinate systems used in the current body of work: (a) hoop coordinate system, and (b) later/polar coordinate sys-tem. Red circles represent the position directly in front of the listener, at azimuth 0 and elevation 0 , included in the figure as a reference. The larger blue circle at the centre of the sphere represents the position of the listener’s head.

Dominant interaural cues

The Interaural Time Differences (ITDs) and Interaural Level Differences (ILDs) between the two ears for a single sound source in space are two crucial cues for determining location. ITDs refer to the difference in travel time of incident sound waves between the two ears for sound sources that are not on the midline (i.e. not at a point equidistant from both ears). For example if a sound were to originate to the right of a listener, the incident waves would arrive at the right ear before the left ear. Input from both ears meets at a structure along the auditory pathway known as the superior olivary complex that is sensitive to small time differences. The normal human threshold for detection of an ITD is up to a difference of 10 s with a relatively large degree of variance between individuals. Experiments conducted using a sphere to model the shape of the head, with a distance of approximately 22-23 cm between the two ears, measured maximum ITDs of approximately 660 s (Woodworth and Schlosberg, 1965).
Studies measuring the accuracy of listeners’ judgements of sound source location, or localisation tasks, using different stimuli suggest that whilst fre-quency dependent interaural phase differences are detectable for low fre-quencies (Zwislocki and Feldman, 1956; Palmer and Russell, 1986) subjects are insensitive to them when an ITD is maintained (Kulkarni et al., 1999). Similar tests have shown that ITDs are probably encoded by mostly low-frequency auditory neurons (Middlebrooks and Green, 1990) for frequencies below about 2 kHz (Blauert, 1997). ITDs are also known to dominate ILD cues for broadband stimuli (Wightman and Kistler, 1992).
ILDs are caused by the absorption of energy primarily by the head and also the body for sound sources off the midline, which produces a shadow-ing of the farthest ear. At low frequencies where the wavelength of sound approaches or is larger than the distance between the listener’s ears, the head does not diffract the incident waves and ILDs are quite weak and thus not a salient feature for localisation. Experiments with spherical head mod-els, using sounds with wavelengths much smaller than its diameter (i.e. the distance between the ears), have measured a maximum ILD of 6 dB for a sound source positioned along the interaural axis (Shaw, 1974). The mini-mum thresholds for ILDs are less than 1 dB (Mills, 1960). The auditory sys-tem most probably integrates level differences (and time differences) over discrete frequency channels, using the most salient and easily recognisable features available (Macpherson and Middlebrooks, 2002).

Spectral cues

The interaural cues described previously are used by the auditory system to estimate a sound source’s position in space. However ITDs and ILDs alone will not provide enough information for localisation in three-dimensional space; this is simply due to the fact that a specific interaural difference can describe any position in space that is equidistant to the listener’s ears. Equidistant points to a listener’s ears in space can be represented as a plane in the shape of a hyperboloid (a hyperbola rotated around the interaural axis), the so-called cone of confusion, as represented in figure 2 (see Katz et al., 2005, for an analysis of cone of confusion forms).
If any position lying on a cone of confusion produces the same interau-ral cues, the auditory system is in need of more information in order to resolve the ambiguity. The additional spectral cue used comes from the in-teraction of sounds with the external ear, namely the pinna. Reflections of sound waves within the folds of the pinna are a perfect candidate for pro-ducing the additional spectral information, due to the asymmetrical shape of the ear; the reflections create a filter for sound sources that is dependent on their position in space. This is due to the fact that a change in the reflection path, determined by a change in position of the sound source, will alter the spectral features of the filter. This position dependent information is what codes for the location of a sound source on the cone of confusion (defined by the interaural cues), and is important for determining elevation (up-down) as well as coding for whether the sound is coming from in front or behind the listener (Wightman and Kistler, 1999). Spectral cues can be thought of as mostly monaural as the majority of localisation studies have found no evidence for an interaction between the signals at both ears (Carlile et al., 2005). There is some work however that suggests the auditory system codes for differences between the spectral cues at both ears for specific regions in space (Jin et al., 2004).
An example of the mentioned spectral cues for a selection of positions in space for the left ear of the author is shown in figure 3 as a function of frequency. The locations are all along the midline. Each location is labelled using azimuth and elevation angles in the form (azimuth, elevation). From top to bottom in the figure, locations run from below and in front of the listener to behind and below the listener along an arc. The figure shows that the spectral features of the filters vary with elevation particularly for higher frequencies, above about 3 kHz; in fact the size of the ear is too small to in-teract with wavelengths for lower frequencies. Also shown are the effects of the head and torso, which are evident at lower frequencies. The frequencies are displayed on a logarithmic scale as this is a good approximation of the resolution on the basilar membrane for different frequencies. Magnitude is also presented on a logarithmic scale as this approximates how the auditory system interprets differences in loudness. The variations in the magnitude spectrum provide different salient cues to location given that humans have a just-noticeable-difference for pure tones of about 1 dB (Zwislocki and Feld-man, 1956).
The variations in the filtering effects of the listener’s morphology are cru-cial to coding for the sound source’s location in space. Figure 4(a) shows the detail of these variations for the left ear filter imposed by the listener’s mor-phology for all positions in front and to the left of the listener (azimuth loca-tions from 0 to -150 and an elevation of 0 ). The colour contours of the two charts represents the magnitude in the spectrum in decibels. One position in space is represented horizontally on the charts, i.e. each horizontal slice represents the variations in the gain of the filter. The auditory system learns the colouration of the spectrum at different positions as a signature for loca-tion and will compare the filtering effects of sound sources with those stored in memory. In comparison to the filter colouration, figure 4(b) displays the level of detail that is actually registered by the auditory system when tak-ing into account the effects of what is termed the cochlear filter; this filter is distinct from the one caused by the pinna and represents the frequency dependence of auditory sensitivity (Carlile and Pralong, 1994). Despite the reduction in detail between the two plots in figure 4, the change in spectral cues as a function of azimuth is still clear in figure 4(b); different horizontal slices, will always be perceptually different due to the different colourations across frequencies. A detailed description of the different known spectral cues and their suspected purpose will be provided in chapter 4.

Minimum-phase and pure delay

For the implementation of binaural synthesis using DTFs, the most common approach for generating the complex transfer function is to use a minimum-phase filter and a pure delay. A minimum-phase spectrum provides a Head-Related Impulse Response (HRIR) with the same spectra as the original HRTF, but with the energy redistributed to a single main impulse. A pure delay is assigned to the minimum-phase filter and is a coarse approximation of HRTF phase since it does not depend on frequency. The use of such an approxima-tion is justified as long as the low-frequency Interaural Time Difference (ITD) information is available; the human auditory system is not sensitive to in-teraural HRTF phase spectra as long as the overall ITD of the low frequency components provides a reliable cue (Kulkarni et al., 1999).
It has also been shown that localisation accuracy for virtual sound sources is not affected by modelling HRTFs as minimum-phase filters and pure delays for most locations in space (Kistler and Wightman, 1992). Minimum-phase modelling works well for frontal sources with relatively small ITDs, but not for sound sources near the interaural axis and behind the listener (Katz et al., 2005). There are various techniques used in the literature for estimating the pure delay component for a minimum-phase representation of HRTFs (see Nicol, 2010, for a review). Probably the most common method used, and the tech-nique of choice in this body of work, is the maximum of the interaural cross-correlation between the left and right HRIRs. This method is specif-ically measuring the time shift between the HRIR envelopes and resembles the mechanism used by the auditory system to determine ITDs, which makes it particularly attractive.

READ  Liaisons & Alliances: French settlers and Quapaw Indians (Native of the Mississippi)

Smoothing of HRTF spectrum

A smoothing of the irregularities in the HRTF spectrum magnitude is also a common final processing step. The fact that the frequency resolution of the auditory system approximates a logarithmic scale means that much of the fine details in the magnitude of the recorded HRTF are not perceived. The cochlear filter removes many of the features in the HRTF that might have been present in the initial recording that will not be encoded by the auditory system (see Carlile and Pralong, 1994, and figure 4 from section 2.2.2 in the previous chapter). In fact spectral cues are robust, in terms of localisation accuracy, to smoothing that results in a loss in frequency resolution far be-yond that imposed by the auditory filtering (see Kulkarni and Colburn, 1998; Macpherson and Middlebrooks, 2003).

headphones in binaural synthesis

Headphones play a crucial role in effectively rendering sound sources in VAS. There exists an interaction between the listener’s outer ear and the sig-nal from the transducer of a headphone that resembles to some degree that of sound waves from a point source in space embodied in the HRTF. In ad-dition, headphones have their own spectral characteristics that can influence the naturalness of a binaural synthesis due to coloration. Chapter 6 studies these interactions and the characteristics of headphones in detail using a lo-calisation task. The following sections provide a background to the role of headphones in binaural synthesis.

Choice of headphone type

In real-world environments, as opposed to laboratory conditions, there are a wide variety of headphones used by listeners. Each headphone has its own characteristics that affect the sound produced. The most obvious difference between headphones is their physical ability to reproduce all frequencies in the audible range. For example, headphones that are inserted into the ear have smaller membranes at the transducer that impose limitations for lower frequencies. More specifically, for a binaural synthesis there are particular regions of frequencies in the spectrum of the signals presented to the lis-tener that are more important than others and can be emphasised, as will be described in chapter 4 and will be analysed in further detail in chapter 8. If a headphone does not effectively produce any signal above a certain fre-quency, say in a region crucial to communicating spectral cues for elevation judgements, sound sources in VAS may appear poorly defined in terms of their perceived position in space. Ideally headphones will produce all au-dible frequencies at the same level, known as a flat frequency response, so that the filtering effects of the HRTF imposed via the binaural synthesis are not affected. In reality, in order to compensate for the physical limitations of a transducer, many headphone manufacturers choose to produce frequency responses that emphasise some frequencies over others and give the head-phones their own spectral flavour and identity.

Variations in frequency response

The ability of headphones to reproduce a signal is determined by the fre-quency response of the hardware itself and for the most part determined by the manufacturer. The most comprehensive investigation of a variety of commercially available headphones was performed by Møller et al. (1995a). In the mentioned study, headphones were organised into three main cate-gories: supra-aural, in which the headphone rests on the ear, circumaural, in which the headphone completely covers the ear making contact only with the head, and free-from-ear, in which the headphone makes no contact with the listener’s head or ear but sits close to the entrance to the ear canal. Two more categories could be added to this list in order to encompass the major-ity of the different types of headphones on the market. The first would be the intra-aural, in which the headphone is small enough to be placed inside the ear canal of the listener, and the second would be that of the bone con-duction, in which vibrations up against the bones connected to the inner ear transmit sounds to the cochlea. The latter has been used for augmented re-ality due to certain advantages it affords, such as being able to leave the ear completely unobscured so that environmental sounds can be heard (Walker et al., 2007).
In the study by Møller et al. (1995a), the frequency response of the head-phones was measured in much the same way as HRTFs are; a small micro-phone is placed at the entrance of the ear canal. A broadband stimulus is presented from the headphones, worn by different listeners, and the re-sponse recorded. Results from the comparison of 14 different headphones belonging to the first three mentioned categories demonstrated that there were significant differences between the frequency responses of the head-phones. In general, it was found that the headphone responses were not flat, showing large fluctuations with frequency. The frequency responses of the headphones were smooth up until approximately 3 kHz. Above this range responses were characterised by large peaks and notches, similar to those found in subject HRTFs. Kulkarni and Colburn (2000) have shown that these spectral features can be of similar magnitude and bandwidth as those in HRTFs.
A significant amount of the variation between the headphones can be ex-plained by resonances forming inside the headphone cavity, particularly for the supra-aural and circumaural types as shown by Xie et al. (2009). In their study, the frequency responses of two circumaural headphones and one research-grade intra-aural headphone (the same as used in the study de-scribed in chapter 6) was measured using a dummy head with small built-in microphones inside the ear canals. The intra-aural headphone displayed less dramatic peaks and notches in its frequency response than the other head-phones tested. The two circumaural headphones displayed vastly different frequency responses.

Table of contents :

i background and literature review 
1 general introduction 
2 human auditory perception 
2.1 Coordinate system
2.2 Auditory cues to sound location
2.2.1 Dominant interaural cues
2.2.2 Spectral cues
2.2.3 Distance cues
2.3 The Head-Related Transfer Function (HRTF)
3 binaural synthesis 
3.1 Measuring HRTFs
3.2 Processing HRTFs
3.2.1 Equalisation
3.2.2 Minimum-phase and pure delay
3.2.3 Smoothing of HRTF spectrum
3.3 Headphones in binaural synthesis
3.3.1 Choice of headphone type
3.3.2 Variations in frequency response
3.3.3 Headphone transfer function
3.3.4 Headphone equalisation
3.4 Interpolating HRTFs
3.5 Measuring the effectiveness of binaural synthesis
3.5.1 Localisation
3.5.2 Listening tests
3.5.3 Externalisation
4 spectral cues 
4.1 Frequency range of spectral cues
4.2 Monaural vs binaural spectral cues
4.3 Temporal and level factors
4.4 Role of spectral detail
4.5 Spectral features
4.5.1 Overt features
4.5.2 Covert features
4.5.3 Spectral cues using broadband models
4.6 Morphological influence on HRTFs
5 hrtf individualisation 
5.1 Using non-individualised HRTFs
5.2 Methods for producing individualised HRTFs
5.2.1 Reduced measurement sequences
5.2.2 Not requiring HRTF measurements on listener
5.2.3 Not requiring HRTF measurements
ii research work 
6 role of headphones in binaural synthesis 
6.1 Background
6.2 Method
6.2.1 Experimental procedure
6.2.2 Stimulus duration and level
6.2.3 LISTEN HRTF database
6.2.4 HRTF selection
6.2.5 Headphone types used
6.2.6 Headphone frequency responses
6.2.7 Headphone equalisation
6.2.8 Localisation task
6.2.9 Measurement of localisation accuracy
6.3 Results
6.3.1 Results of localisation tasks
6.3.2 Lateral angle errors
6.3.3 Polar angle errors
6.3.4 Global measures of localisation accuracy
6.3.5 Effectiveness of the headphone equalisation
6.4 Discussion
6.5 Conclusion
7 perceptual judgements of hrtfs using listening tests 
7.1 Background
7.2 Outline
7.3 Listening Test 1
7.3.1 Listening Test 1 procedure
7.3.2 Listening Test 1 results
7.4 Listening Test 2
7.4.1 Listening Test 2 procedure
7.4.2 Listening Test 2 results
7.5 Listening Test 2.1
7.5.1 Listening Test 2.1 procedure
7.5.2 Listening Test 2.1 results
7.6 Listening Test 2.2
7.6.1 Listening Test 2.2 procedure
7.6.2 Listening Test 2.2 results
7.7 Listening Test 2.3
7.7.1 Listening Test 2.3 procedure
7.7.2 Listening Test 2.3 results
7.8 Listening Test 3
7.8.1 Listening Test 3 design
7.8.2 Listening Test 3 interface
7.8.3 Listening Test 3 results
7.9 Listening Test 3.1
7.9.1 Listening Test 3.1 procedure
7.9.2 Listening Test 3.1 results
7.10 Listening Test 3.2
7.10.1 Listening Test 3.2 subject categories
7.10.2 Listening Test 3.2 procedure
7.10.3 Listening Test 3.2 results
7.10.4 Listening Test 3.2 subjective reports
7.10.5 Listening Test 3.2 reproducibility of responses
7.10.6 Listening Test 3.2 subject expertise
7.10.7 Listening Test 3.2 analysis of judgement time
7.11 Discussion
7.12 Conclusion
8 salient spectral cues for binaural synthesis 
8.1 Background
8.2 Method
8.2.1 Database analysis
8.2.2 HRTF and morphology database
8.2.3 Subjects represented in multidimensional spaces
8.2.4 Validation of multidimensional spaces
8.3 Results
8.3.1 Validation of principal components
8.3.2 Statistical analysis
8.3.3 Optimal frequency range and dimensions
8.3.4 Inspection of principal components
8.3.5 Validation of optimised multidimensional spaces
8.3.6 Comparison of different multidimensional spaces
8.4 Discussion
8.5 Conclusion
9 significant morphological parameters for binaural synthesis 
9.1 Background
9.2 Prediction of subject location in multidimensional spaces
9.2.1 Method
9.2.2 Results
9.3 Machine learning
9.3.1 Decision trees
9.3.2 Method
9.3.3 Results
9.3.4 Support vector machines
9.3.5 Results
9.4 Modification of dummy head pinnae
9.4.1 Method
9.4.2 Results
9.5 Discussion
9.6 Conclusion
10 general conclusion 
10.1 Findings from the research
10.2 Potential applications of research
iii appendix 
a appendix a 
a.1 Listening Test 3.2 protocol
bibliography 

GET THE COMPLETE PROJECT

Related Posts