Modelling approach: An example… But why make models? Derek Zoolander, probably.
In this thesis we will evaluate computational implementations of one-step theories of nonnative speech perception. Notably, we will investigate models from the eld of automatic speech recognition (ASR) which are a direct implementation of the Bayesian model shown in equation 3.14. While only the one-step family of theories will be evaluated in this work, we encourage further research to be done with similar methodologies in order to investigate all of the various co-existing proposals.
Indeed, using computational models in order to investigate competing theories is bene cial in several ways. Firstly, the need to translate the theories into model implementations forces model ideators to provide a mathematically and/or algo-rithmically well-de ned model. This is in contrast with more vague and ambiguous verbally de ned theories that leave more space to reader interpretations. Having more rigorous model de nitions also allows us to better understand competing the-ories (what is the exact nature of the input? which grammar constraints are applied and how? …), meaning that it is easier to compare proposals and see where they di er signi cantly or not.
Secondly, obtaining a computational implementation of a theory means that it is then possible to derive predictions from the models in question. It is then possible to qualitatively and quantitatively examine these predictions, and compare them to what is observed in behavioural data.
For the skeptical reader, possibly left frowning after reading the above state-ments, we will brie y develop an example of how modelling can allow us to test theories in ways that may be unfeasible otherwise. The speci c example, the details of which can be found in Appendix A, is from the literature of developmental psy-cholinguistics, concerning how acoustic di erences in Infant-Directed Speech might or might not promote language learning for infants, compared to Adult-Directed Speech (ADS). Indeed, IDS presents very salient prosodic, lexical, syntactic, and temporal properties (see [Soderstrom, 2007, Golinko et al., 2015] for a review).
The hypothesis (which we refer to as the Hyper Learnability Hypothesis; HLH) was advanced by [Kuhl et al., 1997]. The authors in this study analysed the acoustics of the vowels located at the extremities of the vowel triangle (i.e., /i, a, u/). Analyses showed an increase of the vowel triangle area (in formant space) for IDS compared to ADS. The authors interpreted this as an enhancement of phonemic contrasts, which might help infants identify and acquire phonemic categories more easily. The expansion of the vowel triangle in IDS was also attested in other studies [Andruski
4The equivalent of a weighted sampling procedure was preferred over MAP estimation for percept selection, since participant responses in previous experimental work on epenthesis tended to show variation and were not deterministic.
et al., 1999, Bernstein Ratner, 1984, Burnham et al., 2002, Cristia and Seidl, 2014, Liu et al., 2003, McMurray et al., 2013, Uther et al., 2007], but not systematically across the vowel inventory [Cristia and Seidl, 2014] and, importantly, IDS presented increased within-category acoustic variability [McMurray et al., 2013, Cristia and Seidl, 2014, Kirchho and Schimmel, 2005]. With increased vowel separation and increased within-category variability being opposite e ects, we wondered whether the discriminability of IDS phonemes was higher than for ADS vowels or not.
Using a computational model of the ABX discrimination task5, [Martin et al., 2015] assessed the discriminability of Japanese phonemes in a large corpus of Japanese IDS and ADS. Against expectations, phonemes in IDS were on average less discrim-inable that in ADS. In the Appendix A, we investigated whether the acoustic and phonological advantage of IDS may surface at level of words. However, we found that words in IDS were also less discriminable than in ADS on average.
In this set of studies, the modelling approach allowed us to quantitatively and qualitatively test the HLH, in a large scale (several millions of ABX experimental tri-als simulated), with systematic comparisons of all phonemes/words in the language, without assuming a speci c learning algorithm, and using a richer representation of the acoustics based on a model of how speech is processed by the auditory system (as opposed to formant values in other studies). Importantly, modelling allowed us to evaluate the interaction between e ects advancing opposing hypotheses, and showed us the resulting predictions in a computationally understandable format. A study of this magnitude would have not been possible with traditional experimental tech-niques, which makes modelling a welcomed addition to the experimental evidence gathered for and against the HLH.
In an analogous manner, we will be using computational models of nonnative speech perception in this thesis, in order to investigate the underlying mecanisms of perceptual vowel epenthesis in ways that may not be possible when only using behavioural experiments.
Why do people of di erent linguistic background sometimes perceive the same acous-tic signal di erently? In particular, how is this nonnative acoustic signal processed to become what the listener ends up perceiving?. How much of this process is guided by the information directly accessible in the acoustic signal? What is the contribution of the native phonology? How are these two elements combined when computing the native percept?
In order to answer these questions, various mechanisms underlying nonnative speech perception have been put forward; however, many lack formal de nition that allows them to be tested empirically. In this dissertation, we select one of the proposals advanced by the psycholinguistics literature. Namely, we investigate one-step models of nonnative speech perception [Dupoux et al., 2011, de Jong and Park, 2012, Wilson and Davidson, 2013, Durvasula and Kahng, 2015], which postu-late that acoustic match and sequence match between the nonnative input and the native percept are optimised, simultaneously. To do so, we test a proof-of-concept computational implementation of the model as de ned by [Wilson and Davidson, 2013].
We present various methodologies for qualitatively and quantitatively evaluating the reverse inference proposal. We do this by focusing on the phenomenon of percep-tual vowel epenthesis, namely the phenomenon by which listeners may hallucinate vowels when hearing nonnative speech that does not conform to the structural con-straints of their native language. Of interest are both the rates of vowel epenthesis (i.e., how often do participants experience this?) and variations of epenthetic vowel quality (i.e., which vowel is epenthesized?).
Following on the experimental approach recommended by [Vendelin and Peperkamp, 2006], the data arising from the computational models is compared to data from psycholinguistics experiments. In these, nonnative speech perception is evaluated using psycholinguistics paradigms which tap onto online (i.e., real-time, individual) perception of nonwords, in order to reduce the in uence of confounds such as or-thography and semantics. In other words, we subject the proposed computational models to tasks analogous to those completed by human participants and analyse their behaviour both quantitatively and qualitatively. Do we nd acoustics-based mechanisms to be necessary to predict perceptual vowel epenthesis in human listen-ers? If so, do they su ce?
This dissertation is divided in two main sections. First, in Chapter 2, we use an identi cation paradigm to investigate the in uence of acoustic details on modula-tions of epenthetic vowel quality. We discuss the implications of our results in the context of the opposition between the two-step and one-step theories of nonnative speech perception. We nd that acoustic details modulate epenthetic vowel quality, results that are in agreement with one-step theories. Building on these results, we present a basic model of speech perception exclusively reliant on acoustic matching betwen minimal pairs of nonnative and native speech exemplars. Namely, we build non-parametric exemplar-based models of perception. Relative to human results, we nd the models to be able to reproduce some qualitative e ects linked to the role of coarticulation on epenthetic vowel quality; however, the models are limited by their inability to output responses other than those derived from their speci c inventory of exemplars.
In Chapter 3 we turn to a parametric implementation of a one-step proposal, using tools from the eld of automatic speech recognition (ASR). We present an HMM-GMM speech recognizer composed of independent acoustic and language (i.e., phonotactic) models. These can be tweaked as necessary to test hypothe-ses about the underlying mechanisms of nonnative speech perception. We propose a novel methodology to test ASR systems which use language models represented by Weighted Finite State Transducers (W-FST) in identi cation tasks analogous to those used to test human participants. Using this method, we test the predictive power of the acoustic model on patterns of vowel epenthesis. We nd that the acous-tic model alone better predicts human results than when accompanied by language models, at least when the latter are n-gram based phonotactic models with phones as the unit n. We further test whether some e ects traditionally attributed to phonology may actually be predicted from acoustics alone. Following promising but not perfect results, we propose future research paths for enhancing the methodology
Role of acoustic details in the choice of epenthetic vowel quality
As presented in the Introduction, work on loanword adaptation and online speech perception shows that listeners epenthesize or delete vowels from nonnative input when it does not conform to native non-native phonotactics. While this statement seems to be generally accepted, the mechanisms underlying these phenomena are subject to more debate. In this chapter we will investigate the mechanisms under-lying variations of epenthetic vowel quality.
One-step vs two-step theories
We saw that theories such as those by [Berent et al., 2007, Monahan et al., 2009] view perceptual vowel epenthesis as a two-step process. According to these proposals, the quality of the epenthetic vowel is determined by a language-speci c grammar after an initial parsing of the nonnative input. In contrast, one-step theories such as those proposed by [Dupoux et al., 2011, de Jong and Park, 2012, Wilson and Davidson, 2013, Durvasula and Kahng, 2015] argue that parsing is an optimisation problem where the optimal output maximises the acoustic/phonetic match to the input and the likelihood of the phonemic sequence in the native language.
How can we confront and test these one-step and two-step proposals? For this, we can dissect the phenomenon of perceptual vowel epenthesis and split it into two subproblems:
1. When does epenthesis occur?
2. What vowel is epenthesized?
Concerning the rst subproblem, neither one-step theories nor two-step theories give explicit predictions concerning the rate of epenthesis. It is even unclear if the two-step theories exposed above allow for epenthesis to not happen. In the case of [Berent et al., 2007], not epenthesizing a vowel would require directly yielding the phonetic form, without repairs being performed by the grammar. While [Berent et al., 2007] hypothesizes that this may happen in tasks requiring participants to pay more attention to phonetics, it is unclear in which cases listeners would directly retrieve the phonetic form within a same task, for similar stimuli. In the case of [Monahan et al., 2009], lack of epenthesis would involve a di erent syllabi cation of the input than when epenthesis happens. Therefore, a priori, epenthesis should always happen if the input is syllabi ed according to native phonotactics. In the case of reverse inference one-step theories [Dupoux et al., 2011, de Jong and Park, 2012, Wilson and Davidson, 2013, Durvasula and Kahng, 2015], lack of epenthesis might occur if the optimal match between the nonnative input and the native output is more strongly driven by acoustic/phonetic match than by sequence acceptability.
Table of contents :
1.1 Nonnative speech misperceptions
1.2 Perceptual vowel epenthesis
1.3 Processing steps in perceptual vowel epenthesis
1.4 Modelling approach: An example
2 Role of acoustic details in the choice of epenthetic vowel quality
2.1.1 One-step vs two-step theories
2.1.2 Role of acoustics
2.1.3 Chapter preview
2.2 Which epenthetic vowel? Phonetic categories versus acoustic detail in perceptual vowel epenthesis
2.2.4 Discussion and conclusion
2.3 Predicting epenthetic vowel quality from acoustics
2.3.2 Perception experiment
2.3.3 Acoustic analyses
2.3.4 Production-based exemplar model
2.4 Predicting epenthetic vowel quality from acoustics II: It’s about time!
2.5 General Discussion
3 Modelling speech perception with ASR systems
3.1.1 Implementation of a one-step model
3.1.2 Is the acoustic match sucient?
3.1.3 Chapter preview
3.2 Anatomy of a HMM-based speech recogniser
3.2.3 Acoustic models
3.2.4 Lexicon & language models
3.2.6 Scoring: Assessing native performance
3.3 Investigating the role of surface phonotactics
3.3.2 Experiment 1
3.3.3 Experiment 2
3.3.6 Bigram (phone-level) language model
3.3.7 Online and Retro language models
3.4 Medley of epenthetic variations: Due to phonological processes or embedded in the phonetics?
3.4.2 Experiment 3: Variations due to native phonology
3.4.3 Experiment 4: Variations due to syllabic structure
3.5 General Discussion
3.5.1 Language model contributions
3.5.2 Model adequacy
3.5.3 Predictive power of the acoustic model
3.5.4 Model enhancements
3.5.5 Data enhancements