Chapter 3. Signal processing and measurement techniques in voice analysis
This chapter will examine three methods of measuring the speech signal, along with signal processing methods used to analyse these measurements. Acoustic recordings form the baseline for voice research, and as such any such research needs to be considered in the context of the acoustic signal. In terms of examining the mechanisms for voice production, two more analysis techniques are discussed; laryngoscopy which provides a video view of the vocal folds during phonation, and electroglottography (EGG), which uses the changing electrical resistance across the larynx to give an estimation of vocal fold contact. Given that all three methods are examining the same overall physiological behaviour, they are all interrelated, but for the purposes of this thesis the primary measurement used will be EGG. These methods can be applied to measuring voice and voice quality in general, they will all be used in the analysis of age-related changes in voice quality within this thesis.
Common speech measurement techniques
The acoustic waveform is what constitutes the speech signal that is passed from a speaker to a listener. It contains all of the information that a listener can infer through hearing alone. In research where we wish to understand perceivable features of the speech signal, the particular features that can be perceived from the acoustic are important. This thesis primarily deals with changes in voice quality that occur with age, and these changes are often able to be perceived by listeners (Baken, 2005).
The acoustic speech signal is very rich in information, and ultimately is the most relevant in terms of understanding how we perceive speech. Researchers have been employing acoustic speech recordings in their research for decades and there exists a plethora of different microphones and mechanisms for recording speech signals.
Several measures based on the acoustic speech waveform have been developed over the years. Two of the simpler, common measures derived from acoustic recordings are the fundamental frequency (F0) and sound pressure level (SPL) (Schötz 2007, De Cheveigne and Kawahara,
2001, Noll, 1967). Accurate calculation of the SPL does require the microphone used for the recordings to be calibrated (Winholtz and Titze, 1997).
In terms of measuring voice quality and stability, tracking the movement of F0 and sound pressure level is a common approach. Human hearing is sensitive to both pitch and loudness, so measurement of F0 and sound pressure level allows quantitative measurement of features which relate to perceivable differences in pitch and loudness. It is possible to look at standard deviations of these measures to provide indications of stability, but a common approach is to use perturbation measures of F0 and SPL. Perturbation measures random fluctuation of a given feature from cycle to cycle. In the case of acoustic analysis, the cycles used for analysis are defined by the period of F0. The perturbation of the F0 is referred to as jitter, while the perturbation of SPL is referred to as shimmer.
Laryngoscopy is one of the most direct ways of observing vocal fold vibration during phonation. A view of the vocal folds is achieved through either the use of a rigid scope through the mouth, or a flexible scope usually inserted through the nose. Unlike EGG, laryngoscopy provides information about the changing glottal area, and allows examination of the extent of closure. In terms of video laryngoscopy, the two main approaches are high speed videoendoscopy and videostroboscopy (Olthoff et al., 2007, Deliyski and Hillman, 2010). While the use of high speed videoendoscopy provides the most detail for analysis of cycle to cycle variations in vocal fold vibration, videostroboscopy is still useful for real time visualisation of vocal fold vibration with simultaneous audio playback. It is also the most widely used laryngeal imaging technique in clinical settings (Mehta and Hillman, 2012).
Typical frame rates for high speed laryngoscopy reach 2 kHz (Kendall, 2009), which allows examination of the cycle to cycle variations in vocal fold vibration. However, this requires expensive equipment, and examination of the vibratory behaviour requires later analysis, because the vibrations are too fast for the human eye to follow in real time.
Stroboscopy is more commonly used in a clinical setting when making a diagnosis (Olthoff et al, 2007, Mehta and Hillman, 2012). By strobing the light source for the camera at a rate close to the fundamental frequency of the voiced speech being produced, an aliasing effect is achieved, slowing the apparent rate of vibration to a speed that is easier for a clinician to interpret. For instance, a strobing speed that is 1Hz higher or lower than the F0 of the voice will produce an apparent vocal fold vibration rate of 1 Hz. As the individual frames recorded are each from a different vocal fold cycle, the observed vibratory waveforms are in a sense averaging across many cycles. This reduces the capability to analyse the cycle to cycle stability, but it is possible to make qualitative assessments of the vibratory behaviour based on stroboscopic video.
Aside from the lower level of detail, the primary disadvantage of stroboscopy is the need for real time tracking of F0 to allow appropriate strobing speeds. As such, it is best applied to sustained phonation tasks, preferably with stable phonation characteristics (Deliyski and Hillman, 2010). Concurrent audio and EGG recordings can provide the F0 information necessary to select appropriate strobing speeds, in addition to allowing assessment of the cycle to cycle stability within the segments analysed.
Unfortunately, by virtue of how the camera is able to view the vocal folds, both high speed laryngoscopy and stroboscopy are more invasive than the other procedures described in this chapter. When using a rigid laryngoscope inserted via the mouth, the best view of the vocal folds is obtained when the tongue is maximally depressed, as it is in the production of the vowel /a:/. Flexible laryngoscopy via the nose benefits from maximal tongue fronting from vowels like /i:/ because this helps prevent the tongue from obscuring the view of the vocal folds. In addition, the presence of a laryngoscope in either the mouth or nose can be uncomfortable.
Subsequently, there is potential for reduced naturalness of the voice during laryngoscopic recordings. Vowel choice for analysis is often limited to the vowels /i:/ or /a:/ in order to satisfy the need to keep the tongue from obstructing the view of the vocal folds.
Ideally, we would like to have a method for getting vibratory data from the vocal folds that is more direct than models created from the acoustic waveform, but is less invasive and more conducive to normal voice production than laryngoscopy. In this thesis an appropriate middle ground is found in EGG.
Chapter 1. Introduction
1.3. Primary developments
1.4. Thesis outline
Chapter 2. Analysis of the ageing voice
2.1. Source filter model of speech production
2.2. Measurement of voice quality and stability
2.3. Voice quality and stability in the ageing voice
Chapter 3. Signal processing and measurement techniques in voice analysis
3.1. Common speech measurement techniques
3.2. Signal processing techniques in the assessment of voice quality and stability
Chapter 4. Analysis of sustained vowels using the contact quotient perturbation
4.1. Study methodology
4.2. Analysis tools
4.3. Results of sustained vowel analysis
Chapter 5. Development and comparison of dynamic measures of voice stability
5.2. Methods and materials
5.3. Analysis tools
5.4. hVd word list analysis
5.5. Rainbow passage analysis
5.6. Contour analysis
5.8. Conclusions and future development
Chapter 6. Analysis considerations and future directions
6.1. Electroglottography related to laryngoscopy
6.2. Continued development of Qx as a voice behaviour measure
6.3. Continued research using recorded data set
Chapter 7. Conclusions
GET THE COMPLETE PROJECT
Electroglottography based techniques in the analysis of age related changes in the adult male voice