Vowels MRI database
Vowels MRI database [vow] includes 3D scans (stack of 2D slices) of nine American English vowels from four speakers (2 male – 2 female). The vowels included are /aa, ae, ah, eh, er, ey, ih, iy, ow, uh, uw/. For all phonemes there are 36 axial slices and for most of them there are additionally 38 coronal slices. Image resolution is 256 256. Two types of clean audio recording are provided at 8 KHz, one with the subject lying on a sofa with sustained phonation and one with the subject standing and speaking naturally. Potential uses of this database is examining intersubject variability of oral area during speech production of several vowels [HJPA+03]. Sample images of this database can be seen in Fig. 2.4
ATR MRI database
The ATR MRI database [KTAH09] includes 3D static MRI scans of the five Japanese vowels of one male subject. Every vowel was repeated 64 times. 51 slices were acquired at a resolution of 512 512. cineMRI data are also included at a resolution of 256 256 at 30 fps (frames per second). Database also includes simultaneous and separate audio recordings at a sampling frequency of 48 KHz. Examples of studies with this database include the investigation of the correlation between the body size, vocal tract length and formant, pitch frequencies [HKT+12], vocal tract length estimations based on vowels [KKT+14], temporal changes in area function [THM+06] and others. Examples of images from this database can be seen in Fig. 2.5.
rtMRI-TIMIT database [NBG+11] includes 2D real-time scans of the midsagittal plane from ten subjects (5 males – 5 females). Resolution of the images is 68 68 and the frame rate is 23.18 fps. Subjects are uttering 460 sentences from the MOCHA-TIMIT corpus [Wre00b]. The database includes a wide range of phonemes of American English in several contexts. Additionally it includes simultaneously acquired audio recordings aligned with the video. Audio was acquired at a sampling frequency of 20 KHz and was denoised with the algorithm presented in [BNNN06]. Phonetic transcriptions are also provided for all the data using the algorithm presented in [KBG+11]. For four of the subjects (2 males
– 2 females) EMA recordings with sound of the same 460 sentences are also provided. This database has several applications like estimating articulatory dynamics [PLK+11], dynamic articulatory modeling [BKGN10], articulatory-acoustic mapping [GN10], artic-ulatory recognition [KBRN11], phoneme recognition [DKM18] etc. Examples of images from rtMRI-TIMIT database can be seen in Fig. 2.6.
rtMRI database for Portuguese
rtMRI for Portuguese [TMO+12] includes 2D real-time scans of the midsagittal plane from one female subject. The frame rate is 14 fps and the resolution of the images is in some cases 64 64 and in the rest 128 128. This database is mainly focused on the nasal vowels of Portuguese. It includes rtMRI scans of the nasal vowels in several positions within a word and the isolated nasal and oral vowels with the aim of studying gestural dynamics. It also includes scans of nasal consonants in the VCV context. The corpus was designed so that it can be compared with EMA data presented in [OMT09]. Simultaneous audio recording were acquired at a sampling frequency of 16 KHz and synchronised with the MRI videos. Audio was denoised by OptiMRI software. Potential uses of this database could be automatic segmentation algorithms of the vocal tract [ST15], study of the velar movement of nasal vowels [MOST12], study of the oral articulation of nasal vowels [OMST12] etc. Images from rtMRI for Portuguese database can be seen in Fig. 2.7.
USC-EMO-MRI corpus [KTK+14] includes 2D rtMRI midsagittal scans of emotional speech (anger, sadness, happiness, neutral) from ten actors (5 males – 5 females). The corpus is the grandfather passage and seven custom made additional sentences. Speakers pronounced the corpus in American English with four diﬀerent emotions several times. More specifically, the grandfather passage was repeated once per each emotion with nor-mal pace and twice for neutral, one at a normal and one at a fast pace. For the seven sentences, subjects uttered each of them seven times per each emotion. Image resolution is 68 68 and the frame rate is 23.18 fps. Synchronised audio from simultaneous record-ings is also provided. Audio was sampled at 20 KHz and denoised using the algorithm presented in [BNNN06]. Applications of this database could include articulatory emotion correlation [KLN11] or vocal tract segmentation [KKLN14]. Images from USC-EMO-MRI corpus can be seen in Fig.
USC Speech and Vocal Tract Morphology MRI database
This database includes 2D rtMRI scans of the midsagittal plane and 3D volumetric MRI scans [SST+17]. 17 subjects (8 males – 9 females) were included. 3D acquisitions include, apart from some morphological indicators, scans of 13 sustained vowels and 14 sustained consonants within one phonetic context each. The 3D volume resolution is 150 180 60. 2D corpus mainly includes several repetitions of 24 CVC, 54 VCV, passage reading, sentence reading, and spontaneous speech. Resolution of the images is 68 68 and the frame rate is 23.18 fps. Aligned, simultaneous audio recordings are also included in the database. Audio was recorded at a sampling frequency of 100 KHz, downsampled to 20 KHz and denoised using the algorithm described in [BNNN06]. USC Speech and Vocal Tract Morphology MRI database could be useful for implementing airway segmentation algorithms of the vocal tract [EL] or speech identification algorithms [SSF18]. Sample images from dynamic and static images of this database can be seen in Fig. 2.9 and Fig. 2.10 respectively.
Impact of approximation at the level of velum and epiglottis
Geometric modeling of the vocal tract is used in particular to produce input data for articulatory synthesis [Bir13]. One of the challenges is to obtain a concise description and to remove geometric details that do not change the acoustic parameters significantly, from a perceptual point of view. Those simplifications could lead to a reduction of the number of parameters used to describe the vocal tract geometry, and consequently make the calculation simpler.
In general, more attention is paid to the jaw, tongue, lips and larynx, compared to the velum and epiglottis. The velum is indirectly taken into account more for representing the opening of the velopharyngeal port than for its impact on the oral cavity.
Concerning the epiglottis, its position depends on the size of the pharyngeal cavity, and thus on the tongue position. For a vowel with a large back cavity (as in /i/), the epiglottis stays apart from the back of the tongue. On the other hand, when the back cavity is more constricted (as for /÷/), the epiglottis is sometimes pressed against the back of the tongue (Fig. 3.9).
Our approach to geometric modeling of the vocal tract is based on an articulatory model that independently controls each of the articulators. The first articulator is the mandible (which corresponds to the opening of the jaw) because it influences the tongue and the lips. Two parameters are suﬃcient to control the opening of the jaw with good precision (jaw angle, jaw horizontal position). The tongue is the articulator that achieves the greatest number of articulation places, and its description must be fine enough to reach a precise position and shape. For this reason, (unlike the Maeda model [Mae90]), there are attempts that use between 6 and 10 deformation factors. The influence of the jaw is taken into account to determine the influence of the tongue and lips.
The epiglottis is actually a cartilage, and therefore the influence of other articulators that interact with the epiglottis, i.e. the mandible, tongue, and larynx, is decisive. Hence, their contribution through linear regression factors is more important than its intrinsic deformation factors. Once the midsagittal shape is calculated, it is necessary to find all the resonating cavities, their area functions and the global topology to run the acoustic simulation [EL16]. Geometrical simplifications allow faster simulations and avoid changes of the global topology when a small cavity appears.
The objective is to investigate the impact of geometric simplifications in order to better understand those that can be made without removing important acoustic cues. Unlike Arnela’s work [ADB+16], which treats the vocal tract as a whole by transforming it into a piece-wise elliptical and then cylindrical tube, we treat the articulators separately because articulatory synthesis requires that each of them be controlled independently of each other.
We used MRI data of the vocal tract with simultaneous speech recordings of five French vowels to study the articulators’ eﬀects. The real speech signal was used as a reference. We edited the images to remove the velum and the epiglottis and then used acoustic simulations to see how the transfer function of the vocal tract was aﬀected, and, therefore, what the role is of these two articulators in phonation.
Table of contents :
1.1 Applications of speech production knowledge
1.2 Purpose and motivation
1.3 Global overview
1.3.1 Techniques to capture articulatory information
1.3.2 Speech synthesis approaches
1.3.3 Articulatory data augmentation
1.3.4 Generic speaker modeling
1.4 Thesis organization
2.1 Requirements for a database
2.2 MRI databases for speech production research
2.2.1 Vowels MRI database
2.2.2 ATR MRI database
2.2.3 rtMRI-TIMIT database
2.2.4 rtMRI database for Portuguese
2.2.5 USC-EMO-MRI corpus
2.2.6 USC Speech and Vocal Tract Morphology MRI database
2.2.7 « Seeing speech » database
2.3.1 General description of the ArtSpeechMRIfr database
2.3.2 Data acquisition
2.3.3 Database description
2.4 Conclusion of Databases
3 Acoustic Simulations
3.1 Comparison between various types of simulations
3.1.1 Introduction about acoustic simulations
3.1.2 Data acquisition
3.1.3 Data processing
3.1.4 Acoustic simulations
3.1.5 Electrical simulation
3.1.7 Discussion about various types of simulations
3.2 Impact of head position on phonation
3.2.1 Introduction about the effect of head position on phonation
3.2.3 Discussion about the effect of head position on phonation
3.3 Impact of approximation at the level of velum and epiglottis
3.3.1 Introduction about geometric simplifications of the vocal tract
3.3.3 Discussion about the effect of velum and epiglottis simplification
3.4 Discussion about acoustic simulations
4 2D to 3D extension
4.1 Introduction about 2D to 3D extension
4.2 Dynamic 3D vocal tract shape generation
4.2.1 Acquiring the data
4.2.2 Phonetic alignment of sound recordings
4.2.3 Image transformation
4.2.4 Denoising procedure
4.2.5 Experiments on 3D shape generation
4.2.6 Conclusions about dynamic 3D vocal tract shape generation
4.3 Further extensions
4.3.1 Vocal tract sagittal slices estimation from MRI midsagittal slices
4.3.2 Synthesize MRI vocal tract data using « silence » MR Images
4.4 Discussion about 2D to 3D extension
5 Generic speaker model
5.1.2 Data acquisition
5.1.3 Vocal tract measurements
5.1.4 Atlas construction
5.3 Discussion about generic speaker model
6.1 Contributions of thesis
6.2 Selection of unexplored research questions
6.3 Directions to expand this thesis
7 Résumé détaillé en français
7.2 Bases de données
7.3 Simulations acoustiques
7.4 Transformation 2D à 3D
7.5 Modèle générique de locuteur