Hanoi Vietnamese phonetics and phonology: Tonophone approach

Get Complete Project Material File(s) Now! »

HMM-based Vietnamese TTS

Many works on HMM-based speech synthesis for the tonal languages have been published, not only for the standard synthetic speech but also for the speech with different speaking styles or the expressive speech. For instances, for Mandarin, the work of Duan et al. (2010), Guan et al. (2010), Hsia et al. (2010), Qian et al. (2006), Yu et al. (2013), Zhiwei Shuang (2010) focused on basic problems of improving the naturalness of HMM-based synthetic speech. Mixed-language or bi-lingual speech synthesis was studied in the work of Qian and Soong (2012), Qian et al. (2008) while Li et al. (2010, 2015) worked with expressive speech.
The HMM-based speech synthesis for Thai put attention in tone correctness improvement, such as the work of Chomphan (2011), Chomphan and Chompunth (2012), Chomphan and Kobayashi (2007, 2008), Moungsri et al. (2014); or in speaker-denpendent/indenpendent in Chomphan (2009), Chomphan and Kobayashi (2009).
For the Vietnamese HMM-based speech synthesis, to the extent of our knowledge, there are only two following main groups: (i) from the Institute of Information Technology (IoIT, which belongs to the Vietnamese Academy of Science and Technology) (Dinh et al., 2013, Phan et al., 2013a, 2012, 2013b, 2014, Vu et al., 2009) and (ii) from the Yunnan university, China (He et al., 2011, Kui et al., 2011). Both groups followed the core architect of HTS to develop TTS systems for Hanoi Vietnamese.
We assumed that the first publication on Vietnamese HMM-based speech synthesis was the work of IoIT (Vu et al., 2009). This system simply applied the HTS for Vietnamese with the training corpus including 3000 phonetically-rich sentences, semi-automatically labeled at phoneme-level. Hanoi Vietnamese phonetic and phonology and tonal aspects were considered when building the phone and feature sets for the system. Features at phoneme, syllable, word (including Part-Of-Speech POS), phrase and utterance level were chosen. There were additional features of tone types of preceding, current and succeeding syllable, compared to the feature set of English. This work reported that the intelligibility of the synthetic utterances is approximately 100%, and the quality of synthesis speech ranges from fair to good (3.23 on a 5 point MOS scale) through the preliminary evaluations (number of subjects was not mentioned).

Main issues on Vietnamese TTS

The initial motivation of this work was to build a high-quality TTS system assisting Vietnamese blind people to access written text. The scope of this work was then narrowed to build a high-quality TTS system with unlimited vocabulary. Based on all the above analyses on the two state-of-the-art TTS techniques, the HMM-based approach was chosen to build a TTS system for Vietnamese. Beside the predominance on general quality, footprint and robustness; there exists a core part from HTS, and a number of supporting platforms to build an HMM-based TTS system.
This section gives main issues that we encountered during the realization of our TTS system. General solutions of this research for these issues are also introduced.

Building a complete TTS system

The work of Tran (2007b) presented a complete architecture of Vietnamese concatenative TTS, which composes of both high-level and low-level speech synthesis. However, there were still numerous issues in automatic analysis of text such as text normalization, word segmentation, or POS tagger. To our best of knowledge, most of previous research on HMM-based Vietnamese TTS (Vu et al., 2009)(Kui et al., 2011)(Dinh et al., 2013)(Phan et al., 2013b) adopted HTS (HMM-based Speech Synthesis System) 14 framework for experiment. They presented only the core architecture from HTS (Zen et al., 2007), which mainly presents training and synthesis parts. All processes in these two parts can be performed using existing tools from HTS or other frameworks. However, the text analysis or the natural language processing part was not investigated in detail.
Although Vietnamese is an alphabetic script, there existed issues in automatic text analysis in the high-level such as text normalization, word segmentation, POS tagger due to a number of the language’s distinguishable characteristics from the occidental languages. Spaces and punctuations in the occidental languages can be used as the main predictors of word segmentation, yet in Vietnamese, there is no word delimiter or specific marker that distinguishes the boundaries between words. Blanks are not only used to separate words, but they are also used to separate syllables that make up words. Moreover, the Vietnamese language creates complex words by combining syllables that most of the time possess an individual meaning. As a result, there are ambiguities in word segmentation that need to be addressed. Real texts in Vietnamese often include many Non-Standard Words (NSW) that one cannot find their pronunciation by using letter-to-sound rules (e.g. numbers, abbreviations, date). In addition, there is a high degree of ambiguity in pronunciation (higher than for ordinary words) so that many items have more than one plausible pronunciation, and the correct one must be disambiguated by context. This raises a real problem in text normalization. Vietnamese is an inflectionless language in which its word forms never change, regardless of grammatical categories, which leads to a special linguistic phenomenon common in Vietnamese, called type mutation, where a given word form is used in a capacity that is not its typical one (a verb used as a noun, a noun as an adjective. . . ) without any morphological change (Le et al., 2010). This property introduces a huge ambiguity in POS tagging.
In this work, we presented a complete architecture of an HMM-based TTS system, composing three parts: natural language processing, training and synthesis part. Constituent modules in the natural language processing part were investigated and constructed. As a result, a complete HMM-based TTS system for Vietnamese was built in this work.

Prosodic phrasing modeling

HMM-based speech synthesis provides a statistical and machine learning approach, in which speech parameters and contextual features are force-aligned to build trained models. Each HMM also has its state-duration distribution to model the temporal structure of speech. As a result, prosodic cues such as intonation, duration can be well learned in context. This considerably increases the naturalness of the synthetic voice. The remaining problem in prosodic analysis is prosodic phrasing, including pause insertion and lower levels of grouping syllables. In an HMM-based TTS system, a pause is considered a phoneme; its duration hence can be modeled. However, the appearance of pauses cannot be predicted by HMMs. Lower phrasing levels above words may not be completely well modeled with basic features.

READ CHALLENGES FACED BY THE ASSET MANAGEMENT INDUSTRY

Table of contents :

Notations and Abbreviations
List of Tables
List of Figures
Lists of Media files
1 Vietnamese Text-To-Speech: Current state and Issues
1.1 Introduction
1.2 Text-To-Speech (TTS)
1.2.1 Applications of speech synthesis
1.2.2 Basic architecture of TTS
1.2.3 Source/filter synthesizer
1.2.4 Concatenative synthesizer
1.3 Unit selection and statistical parametric synthesis
1.3.1 From concatenation to unit-selection synthesis
1.3.2 From vocoding to statistical parametric synthesis
1.3.3 Pros and cons
1.4 Vietnamese language
1.5 Current state of Vietnamese TTS
1.5.1 Unit selection Vietnamese TTS
1.5.2 HMM-based Vietnamese TTS
1.6 Main issues on Vietnamese TTS
1.6.1 Building phone and feature sets
1.6.2 Corpus availability and design
1.6.3 Building a complete TTS system
1.6.4 Prosodic phrasing modeling
1.6.5 Perceptual evaluations with respect to lexical tones
1.7 Proposition and structure of dissertation
2 Hanoi Vietnamese phonetics and phonology: Tonophone approach
2.1 Introduction
2.2 Vietnamese syllable structure
2.2.1 Syllable structure
2.2.2 Syllable types
2.3 Vietnamese phonological system
2.3.1 Initial consonants
8 Contents
2.3.2 Final consonants
2.3.3 Medials or Pre-tonal sounds
2.3.4 Vowels and diphthongs
2.4 Vietnamese lexical tones
2.4.1 Tone system
2.4.2 Phonetics and phonology of tone
2.4.3 Tonal coarticulation
2.5 Grapheme-to-phoneme rules
2.5.1 X-SAMPA representation
2.5.2 Rules for consonants
2.5.3 Rules for vowels/diphthongs
2.6 Tonophone set
2.6.1 Tonophone
2.6.2 Tonophone set
2.6.3 Acoustic-phonetic tonophone set
2.7 PRO-SYLDIC, a pronounceable syllable dictionary
2.7.1 Syllable-orthographic rules
2.7.2 Pronounceable rhymes
2.7.3 PRO-SYLDIC
2.8 Conclusion
3 Corpus design, recording and pre-processing
3.1 Introduction
3.2 Raw text
3.2.1 Rich and balanced corpus
3.2.2 Raw text from different sources
3.3 Text pre-processing
3.3.1 Main tasks
3.3.2 Sentence segmentation
3.3.3 Tokenization into syllables and NSWs
3.3.4 Text cleaning
3.3.5 Text normalization
3.3.6 Text transcription
3.4 Phonemic distribution
3.4.1 Di-tonophone
3.4.2 Theoretical speech unit sets
3.4.3 Real speech unit sets
3.4.4 Distribution of speech units
3.5 Corpus design
3.5.1 Design process
3.5.2 The constraint of size
3.5.3 Full coverage of syllables and di-tonophones
3.5.4 VDTS corpus
3.6 Corpus recording
3.6.1 Recording environment
3.6.2 Quality control
3.7 Corpus preprocessing
3.7.1 Normalizing margin pauses
3.7.2 Automatic labeling
3.7.3 The VDTS speech corpus
3.8 Conclusion
4 Prosodic phrasing modeling
4.1 Introduction
4.2 Analysis corpora and Performance evaluation
4.2.1 Analysis corpora
4.2.2 Precision, Recall and F-score
4.2.3 Syntactic parsing evaluation
4.2.4 Pause prediction evaluation
4.3 Vietnamese syntactic parsing
4.3.1 Syntax theory
4.3.2 Vietnamese syntax
4.3.3 Syntactic parsing techniques
4.3.4 Adoption of parsing model
4.3.5 VTParser, a Vietnamese syntactic parser for TTS
4.4 Preliminary proposal on syntactic rules and breaks
4.4.1 Proposal process
4.4.2 Proposal of syntactic rules
4.4.3 Rule application and analysis
4.4.4 Evaluation of pause detection
4.5 Simple prosodic phrasing model using syntactic blocks
4.5.1 Duration patterns of breath groups
4.5.2 Duration pattern of syllable ancestors
4.5.3 Proposal of syntactic blocks
4.5.4 Optimization of syntactic block size
4.5.5 Simple model for final lengthening and pause prediction
4.6 Single-syllable-block-grouping model for final lengthening
4.6.1 Issue with single syllable blocks
4.6.2 Combination of single syllable blocks
4.7 Syntactic-block+link+POS model for pause prediction
4.7.1 Proposal of syntactic link
4.7.2 Rule-based model
4.7.3 Predictive model with J48
4.8 Conclusion
5 VTED, a Vietnamese HMM-based TTS system
5.1 Introduction
5.2 Typical HMM-based speech synthesis
5.2.1 Hidden Markov Model
5.2.2 Speech parameter modeling
5.2.3 Contextual features
5.2.4 Speech parameter generation
5.2.5 Waveform reconstruction with vocoder
5.3 Proposed architecture
5.3.1 Natural Language Processing (NLP) part
5.3.2 Training part
5.3.3 Synthesis part
5.4 Vietnamese contextual features
5.4.1 Basic Vietnamese training feature set
5.4.2 ToBI-related features
5.4.3 Prosodic phrasing features
5.5 Development platform and configurations
5.5.1 Mary TTS, a multilingual platform for TTS
5.5.2 Mary TTS workflow of adding a new language
5.5.3 HMM-based voice training for VTED
5.6 Vietnamese NLP for TTS
5.6.1 Word segmentation
5.6.2 Text normalization (vted-normalizer)
5.6.3 Grapheme-to-phoneme conversion (vted-g2p)
5.6.4 Part-of-speech (POS) tagger
5.6.5 Prosody modeling
5.6.6 Feature Processing
5.7 VTED training voices
5.8 Conclusion
6 Perceptual evaluations
6.1 Introduction
6.2 Evaluations of ToBI features
6.2.1 Subjective evaluation
6.2.2 Objective evaluation
6.3 Evaluations of general naturalness
6.3.1 Initial test
6.3.2 Final test
6.3.3 Discussion on the two tests
6.4 Evaluations of general intelligibility
6.4.1 Measurement
6.4.2 Preliminary test
6.4.3 Final test with Latin square
6.5 Evaluations of tone intelligibility
6.5.1 Stimuli and paradigm
6.5.2 Initial test
6.5.3 Final test
6.5.4 Confusion in tone intelligibility
6.6 Evaluations of prosodic phrasing model
6.6.1 Evaluations of model using syntactic rules
6.6.2 Evaluations of model using syntactic blocks
6.7 Conclusion
7 Conclusions and perspectives
7.1 Contributions and conclusions
7.1.1 Adopting technique and performing literature reviews
7.1.2 Proposing a new speech unit – tonophone
7.1.3 Designing and building a new corpus
7.1.4 Proposing a prosodic phrasing model
7.1.5 Designing and constructing VTED
7.1.6 Evaluating the TTS system
7.2 Perspectives
7.2.1 Improvement of synthetic voice quality
7.2.2 TTS for other Vietnamese dialects
7.2.3 Expressive speech synthesis
7.2.4 Voice reader
7.2.5 Reading machine
List of publications
A Vietnamese syntax parsing
A.1 Syntax theory
A.1.1 Syntax and grammar
A.1.2 Parts Of Speech (POS)
A.1.3 Phrase structure grammar
A.1.4 Dependency structure grammar
A.2 Syntactic parsing techniques
A.2.1 Treebank corpus
A.2.2 Generative models
A.2.3 Discriminative models
A.2.4 Perceptron
A.2.5 Advanced parsing methods
A.3 Vietnamese classifiers
B Corpus design and prosodic phrasing modeling
B.1 Semi-automatic correction of breath noise labeling
B.2 VNSP-ThuTrang
B.3 Syntactic rules
B.3.1 Formal symbols representing syntactic rules
B.3.2 Proposal of syntactic rules
B.4 Breath groups and syllable ancestors
B.5 Syntactic blocks
B.6 Algorithm of syntactic-block devision
B.7 Syntactic-block+link+POS model
C VTED design, construction and perceptual evaluations
C.1 The ToBI transcription model
C.2 Mary TTS platform
C.3 Examples of test GUI screens
C.4 Test corpus examples
Bibliography