Difficulties with Automatic Speech Recognition

Get Complete Project Material File(s) Now! »

Automatic Speech Recognition

Automatic Speech Recognition (ASR) is known as the interpretation of human speech. Jurafsky [16] proposed a technical definition of ASR as a process of map-ping acoustic signals to a string of words. He extended this concept to Automatic Speech Understanding (ASU) and the process was continued to produce a sort of understanding of the sentence.
Mats Blomberg [5] made a review on some applications using ASR and discussed the advantages that can be achieved. For example, the voice-controlled computer that was used for dictation could be an important application for physically disabled and lawyer, etc. Another application was environmental control applications, such as turning on a light and controlling the TV channels. In general, people with jobs where their hands are occupied, would greatly benefit from ASR-controlled applications.
In Speech Recognition, a phoneme is the smallest linguistic unit that can be combined together with other phonemes to create a sound « one » that would call a word. Words are made of letters, whereas the sound is heard, is made of phonemes. For example letters « o », « n », « e » that are combined together makes the word « one » and phonemes « W », « AH », « N » make a sound « one ».Sample sonogram of a word ’zero’ with marked boundaries for four phonemes is presented in Figure 2.1.

Acoustic Phonetic Approach

In Acoustic Phonetic, each phoneme has its representation (or multiple representa-tions) in the frequency domain, as illustrated in Figure 2.2. Some of them put more stress on low frequencies, some prefer high frequencies. Those sets of frequencies are called in acoustics « formant ».
During the recognition process, the speech signal has to be analyzed and the most probable distribution of the formant is retrieved [13]. It is based on assumption that, in language, there are finite and identifiable phonetic units that can be mapped out to a set of acoustic properties appearing in a speech signal. Even though the acoustic properties of phonetic units can be variable, the rules are not complicated and are learn by a machine with huge amounts of data.
This method contains several phases. In the first step, spectral measurements are converted to a set of features as descriptions for acoustic properties of the dif-ferent phonetic units. In the segmentation and labeling phase, the speech signal is segmented into stable acoustic regions, and then one or more phonetic labels is attached to each segmented region which results in a phoneme lattice characteriza-tion of speech. At the end, a valid word is determined by considering the phonetic label sequences that are produced by the previous step [30].

Pattern Recognition Approach

Another approach is pattern matching, which refers to two essential steps namely, pattern training and pattern comparison. In this approach, a well-formulated math-ematical framework is used to establish consistent speech pattern representations. Then the pattern comparison is done based on a set of labeled training samples and a formal training algorithm [15, 17, 27].
This method compares the unknown speeches with each possible pattern. Pat-terns are made in training phase. It tries to determine it according to one of the known patterns. During the last six decades, Pattern Recognition approach has been the most important method for speech recognition [30].

Artificial Intelligence Approach (Knowledge Based Approach)

The next approach is a hybrid of the acoustic phonetic approach and pattern recog-nition approach. This knowledge-based method utilized some phonetic, linguistic and spectrogram information. In this approach, the recognition system was devel-oped for having more speech sounds [2].
It provides little insight into the processing of the human speech and had an increased difficulty in analysis errors. However, it made a good and helpful linguistic and phonetic literature for better understanding of the human speech process. In fact, the recognition system and expert’s speech knowledge are combined in this design. Usually, this knowledge is taken from a broad study of spectrogram and uses certain rules and procedures [2].
In this method, knowledge has also been used to guide the design of the models and algorithms of other techniques such as template matching and stochastic mod-eling. This form of knowledge application makes an important distinction between algorithms and knowledge that helps us in solving problems. Knowledge enables the algorithms to work better. In most successful strategic designs, these kinds of systems have significant contributions. It plays an important role in the selection of a suitable input representation, the definition of units of speech, or the design of the recognition algorithm itself [2].

Search by Voice Efforts at Google

Historically, searching for information through voice recognition is a common way historically that is not only specific to Computer or web domains. Already thirty years ago, people could dial directory assistance and ask an operator for a telephone number [29].
One of the most significant efforts was 800-GOOG-411 [3] when Google inte-grated speech recognition and web search in the form of an automated system.
This machine was used to find and call businesses. It was based on store-and-forward technology so that instead of direct interaction between the user and the operator, a request (city and state) was stored and later played it into the operator. In between, a search was constrained to businesses in the requested city [29].
Next version of GOOG-411 was released in 2008. It was able to search for a single utterance so there was no need to split apart the locations and businesses. The interaction became faster and provided greater flexibility for the users in the way they stated their needs. In this approach, the challenging part was extending the language model to a wider range other than to those businesses in the requested city[31].
Another Google voice detection effort emerged for iPhone to include a search via a voice feature named Google Mobile App (GMA). It extends the domain of a voice search from businesses on a map to the entire World Wide Web. Unlike GOOG-411, GMA is not very domain-dependent and it must handle anything that Google search can do. Apparently, it is more challenging due to the large range of the vocabulary and complexity of the queries [29]. The basic system architecture of the recognizer behind Google voice detection is illustrated in Figure 2.3.
Figure 2.3: Basic block diagram of a speech recognizer

Difficulties with Automatic Speech Recognition

When considering the difficulties of automatic speech recognition, it is divided into three different categories. The first category is the role of humans in the process; second is related to the technology used for retrieving the speech, and lastly are the concerns of ambiguity and other characteristics and properties of the language.


Human comprehension of speech is not comparable to automatic speech recognition in computers. Humans use both their ears and their knowledge about the speaker and the subject context. Humans can predict words and sentences by knowing the correct grammatical structures, idioms and the way that they say things. Also, body language, eye contact and postures are other advantages that human have. In contrast, ASR just employs the speech signal, statistical and grammatical models to guess the words and sentences [14].
Speaker variability is the other potential difficulty in retrieving the speech. Each speaker has a unique voice, speaking style and unique way to pronounce and em-phasize words. Variability is also increased when the speaker shows his feelings such as sadness, anger or happiness etc. while speaking [14].

READ  Anatomy of the striatal mosaic: multiple levels of organization at the core of the brain


Technology has affected the process of converting speech to text. Speech is uttered in an environment of sounds and noises. These noises and echo effects should be identified and filtered out from the speech signal by modern and well-designed tools. Channel variability is another issue in ASR that can be revealed by the fact that the quality of microphones can affect the content of acoustic waves from the speaker to the computer. Moreover, for matching a large amount of sounds, words and sentences that are generated by the ASR computer, having good analysis tools and a comprehensive lexicon are necessary. These examples show how technology plays a role in the speech recognition process [14].


One of the main issues in ASR is language modeling where it is needed to specify the differences between spoken and written language. For instance, grammar in spoken language differs from written language [1].
Another issue in modeling a language is ambiguity, where it cannot be decided easily which set of words is actually intended. This is called homophones, e.g. their and there. Word boundary ambiguity is another issue in modeling a language, where there are multiple ways of grouping phones into words [12].


For evaluating the quality of a system, it is required to choose proper metrics. The metrics provide the better understanding of the system and the problems. Some of the well-known metrics in this literature is reviewed here, to get an idea of how proposed model can be evaluated.

Word Error Rate

Word Error Rate (WER) measures the performance of an ASR. It compares words outputted by the recognizer to reference what users say [29]. Errors in this method are substitution, insertion and deletion. This method is defined as:
WER= S+D+I N where
• S: is the number of substitutions
• D: is the number of deletions
• I: is the number of insertions and
• N: is the number of words in the reference Following are some examples to clarify WER:
• Deletions: consider the sentence for recognition is « Have a nice time », but ASR guess is « Have a time » in this case ’nice’ was deleted by the ASR, so a deletion happened.
• Insertion: consider the sentence for recognition is « what a girl! » but ASR guess is: what a nice girl in this case ’nice’ was inserted by the ASR, so a insertion happened.
• Substitution: consider the sentence for recognition is « Have a nice day », but ASR guess is « Have a bright day » in this case ’nice’ was substitution by the ASR, so a substitution happened.


While searching based on voice recognition, leaving one word out may not affect the final result. For instance, if user searches for « what is the highest mountain in the world? » The missing functioning words like « in » generally do not change the result [29]. Likewise, the misspelling of a plural form of a word (missing s) might not have an affect on the search either. These cases are considered in Web-Score metrics, which is based on measuring how many times a search result varies from a voice query. Using this Web-Score can specify the semantic quality of a recognizer [29].

Out of Vocabulary Rate

Out of Vocabulary (OOV) shows the percentage of words that are not modeled by the language model. The importance of this metric is in surrounding words where the recognizer cannot recognize the word because of subsequent poor predictions in the language model and acoustic misalignment. It is important to keep this rate as low as possible [29].


Time from when the user finishes speaking until the search result becomes visible on the screen is defined as latency of the system. It is aff ected by many factors such as the time to perform query, time it takes to render the search result and etc. [29].

Phonetic algorithm

Retrieving name-based information is complicated due to misspellings, nicknames and cultural variations. Languages are continuously developing over time and in many languages, especially English, how words are written might be different to how words are pronounced. Therefore, orthography is not a good candidate, whereas, phonology can reflect the current state of a language. Therefore, there is a demand to have a set of phonetic representations for each word of a language [6].


In 1918, Russell developed Soundex algorithm. It was the first algorithm that tried to find a phonetic code for similar sounding words and was based on the fact that the nucleus of names consists of some sounds, which inadequately define names. These sounds may consist of one or more letters of the alphabet. As many names might have different spelling, the main idea of Soundex is to index a word by how it is pronounced rather than alphabetically written. It is explained in more detail, as Soundex is the base of other phonetic algorithms [8].
The Soundex code of a word consists of a letter followed by three digits. The letter is the first alphabet of the word and digits are in range of 1 to 6 that indicate different categories of specific letters according to table 2.1. If the word is not long enough to generate 3 numbers, an extra 0 is added to its end and if the generated code is longer than 3 numbers, the rest is shrunken. For instance, ’W252’ is the code corresponds to ’Washington’ and ’W000’ is code for ’Wu’ [8].
Soundex has some problems, as the accuracy is not good enough. There is a possibility of irrelevant matches (false positive), which causes low precision and it is the user who should check the correct word among many useless words after re-ceiving the encoded results. This is common problem for many key-based algorithm [24].

Table of contents :

1 Introduction 
1.1 Motivation
1.2 The Aim of This Work
1.3 Possible Solution
1.4 Overview
2 Background 
2.1 Automatic Speech Recognition
2.1.1 Acoustic Phonetic Approach
2.1.2 Pattern Recognition Approach
2.1.3 Artificial Intelligence Approach (Knowledge Based Approach)
2.2 Search by Voice Efforts at Google
2.3 Difficulties with Automatic Speech Recognition
2.3.1 Human
2.3.2 Technology
2.3.3 Language
2.4 Metrics
2.4.1 Word Error Rate
2.4.2 Quality
2.4.3 Out of Vocabulary Rate
2.4.4 Latency
2.5 Phonetic algorithm
2.5.1 Soundex
2.5.2 Daitch-Mokotoff Soundex System
2.5.3 Metaphone
2.5.4 Double Metaphone
2.5.5 Metaphone
2.5.6 Beider Morse Phonetic Matching
2.6 Google Glass
2.6.1 Google Glass Principles
2.6.2 User Interface
2.6.3 Technology Specs
2.6.4 Glass Development Kit
2.6.5 The Mirror API
2.7 Elastic-search
2.7.1 NGrams Tokenizer
2.7.2 Phonetic Filters
3 Related Works 
4 Implementation 
4.1 Motivation of selected Phonetic Algorithms
4.2 Motivation of Showing 6 Rows in Each Page
4.2.1 Google Glass Limitation
4.2.2 Users Behavior in Searching Areas
4.2.3 Conclusion
4.3 Data Set
4.4 Norconex
4.4.1 Importer Configuration Options
4.4.2 Committer Configuration Options
4.4.3 More Options
4.5 Elasticsearch
4.6 Demo Of Application
5 Evaluation 
5.1 Test data
5.2 Quantity Tests
5.3 Result
5.4 Best Algorithm For Google Glass
5.4.1 Improving Precision and F-measure
5.5 Conclusion
6 Discussion and Conclusions 
6.1 Development Limitations
6.2 Further Research
List of Tables
List of Figures


Related Posts