Decomposing the acoustic and linguistic modeling

Get Complete Project Material File(s) Now! »

Confusability: an ASR error analysis

Work on pronunciation modeling for automatic speech recognition systems has had mixed results in the past; one likely reason for poor performance is the increased confusability in the lexicon that results from just adding new pronunciation variants. In addition, increasing the number of pronunciations within a system often increases the decoding time. Both of these problems are related to the concept of confusability: words with similar phonetic sequences can be confused with each other. With more pronunciations, the homophone rate (words that sound the same but are written differently) increases, which means that these additional variants may not be helpful to the recognition performance (Tsai et al., 2001). Such cases, in particular for frequent words, can be responsible for the degradation of the ASR system when many alternatives are added.
At this stage, it could be interesting to see how the confusions are perceived by humans and if there is a correspondence between these human methods of perception and the way confusion is perceived and analyzed in an ASR system. The problem of identifying the unit of human acoustic perception is still an open issue, but there have been a lot of studies that describe the confusable segments for the human auditory system taking a sub-word (phonetic unit) approach. In segmental phonology, phonemes are the sounds which can distinguish one word from another. A phoneme is defined as a contrastive phonological segment (Chomsky and Halle, 1968). However, most machine-based perception methods perform the confusability analysis at the word level, limiting themselves to observed spoken examples (Mangu et al., 2000), (Goel et al., 2004). One attempt to address this weakness is the thorough descriptive work of (Greenberg et al., 2000), in which the outputs from eight recognition systems are compared at the phonetic level. Their analysis shows that phonetic and word errors are correlated and conclude that acousticphonetic variation is a leading contributor to word errors. The missing link in this work is, however, an analysis of how the phonetic variability affects word recognition. A second drawback of these methods is that they are based on an a posteriori analysis of the speechrecognition errors and are unable to make any predictions (Printz and Olsen, 2000), (Deng et al., 2003). For example, (Chase, 1997) developed a technique for assigning blame for a speech recognition error to either the acoustic or language model. This error assignment allows system developers to focus on certain speech segments for improving either of the two models; however, one cannot use this model to generate new predictions. Thus, the capability to generalize to unseen speech data is missing and restrains the use of such techniques.

Speech-dependent lexicons

There are efforts made in the direction of constructing lexicons that will constrain the confusability caused by the recognition lexicon. One such effort is the construction of speech-dependent lexicon, adapted to the data, ideally with weights suitably trained. To do so, the FST representation of the data is proven to be efficient (for more details on the FSTs see Section 2.3). However, other representations of the phonemic sequences are also possible. In (Chen, 2011), for example, the reference and surface phones are aligned using an HMM representation. Most of these methods make use of a suitable way to generate the uttered phoneme sequence, align it with the reference sequence and find the surface (spoken) pronunciations that correspond to the baseform pronunciations. These methods are of course limited to words present in the training set.
To circumvent this limitation, it is also possible to extract phonological rules once the alignment is done. These rules are not the result of linguistic knowledge as the ones used in knowledge-based approaches (see Section 2.1). It is not even sure they correspond to any linguistic phenomena. They just adapt the baseform pronunciations to a transcription that better matches the spoken utterance. These rules can better represent a particular speaker or even compensate for errors of the ASR system. Some examples of such approaches are given in (Cremelie and Martens, 1999), (Riley et al., 1999), (Yang et al., 2002), (Akita and Kawahara, 2005) and (Van Bael et al., 2007).

READ My idiosyncratic model of professional development

Combining g2p conversion and speech-dependent lexicons

An interesting idea of lexicon enhancement was presented in (Beaufays et al., 2003). Their procedure works by initializing a hypothesis with a g2p converter and thereafter refining it with hypotheses from the joint alignment of phone lattices obtained from audio samples and the reference transcriptions. Good results are shown for proper name recognition using this method. Another example of such lexicon enhancement combining g2p conversion and automatic transcriptions is presented in (Choueiter et al., 2007). Thus, transcribed utterances can be used to correct a lexicon generated by a g2p conversion, which is prone to errors especially for low-frequency and irregular words. This idea is developed also in (Bodenstab and Fanty, 2007) using a multi-pass algorithm combining audio information and g2p conversion. During the first pass, audio samples are analyzed and frequent phonetic deviations from the canonical pronunciation (generated previously by a g2p converter) are derived. The second pass then constrains the set of possible pronunciation variations, and forces each audio sample to “choose” which pronunciation best represent its acoustics.

Table of contents :

Abstract
Acknowledgements
Table of Contents
List of Figures
List of Tables
1 Introduction
1.1 Automatic Speech Recognition
1.2 Pronunciation variation
1.3 Grapheme-to-phoneme conversion
1.4 The confusability problem
1.5 Motivation
1.6 Thesis outline
2 Background and State-of-the-art
2.1 Grapheme-to-phoneme conversion
2.2 Phonemic confusability
2.2.1 Confusability: an ASR error analysis
2.2.2 Moderating confusability
2.2.3 Speech-dependent lexicons
2.2.4 Combining g2p conversion and speech-dependent lexicons
2.2.5 Phonemic confusability in the Keyword-Spotting task
2.3 FST background
2.3.1 Generalities
2.3.2 Semiring
2.3.3 Weighted Finite-State Transducers
2.3.4 Some useful semirings
2.3.5 Algorithms
2.3.6 Entropy semiring
2.3.7 Matchers
2.3.8 FST-based speech recognition
3 SMT-inspired pronunciation generation
3.1 Introduction
3.2 Methodology
3.2.1 Moses as g2p and p2p converter
3.2.2 Pivot paraphrasing approach
3.3 Experimental setup
3.4 Evaluation
3.4.1 Definition of evaluation measures
3.4.2 G2P conversion results
3.4.3 P2P conversion results
3.5 Speech recognition experiments
3.6 Conclusion
4 Pronunciation confusability
4.1 Introduction
4.2 A new confusability measure
4.2.1 ASR decoding with FSTs
4.2.2 Decomposing the acoustic and linguistic modeling
4.2.3 Definition of pronunciation entropy
4.3 Phoneme recognition
4.4 Pronunciation entropy results
4.5 Conclusion
5 Phoneme confusion model in ASR
5.1 Introduction
5.2 Problem set-up
5.3 Training criteria
5.3.1 The CRF model
5.3.2 Soft-margin CRF
5.3.3 Large-margin methods
5.3.3.1 Perceptron
5.3.3.2 Max-margin
5.3.4 Optimization algorithm
5.4 An FST-based implementation
5.4.1 Preprocessing
5.4.2 Defining the input and output FSTs
5.4.3 Computing the edit distance with FSTs
5.4.4 Discriminative training algorithms
5.4.4.1 Perceptron
5.4.4.2 Max-margin
5.4.4.3 CRF
5.4.4.4 Soft-margin CRF
5.5 Experimental setup
5.6 Phonemic analysis
5.7 Evaluation
5.7.1 Computation of the objective
5.7.2 Phoneme Accuracy
5.7.3 Decoding process
5.7.4 Discussion of the results
5.8 Conclusion
6 Confusion model for KWS
6.1 Introduction
6.2 Keyword spotting system
6.2.1 Indexing and searching representation
6.2.2 Confusion model
6.2.3 Confusion model initialization
6.3 Confusion model training
6.3.1 The Figure of Merit
6.3.2 Discriminatively optimizing the Figure of Merit
6.4 Experimental setup
6.5 Results
6.6 Conclusion
7 Conclusion and Perspectives
7.1 Thesis summary
7.2 Perspectives
Appendix A Phoneme set for American English
Appendix B Publications
Bibliography