Minimum classification error training

Get Complete Project Material File(s) Now! »

Weighted likelihood term

One potential problem with string-level MCE is that parts of the training string that do not result in errors are effectively ignored and therefore do not reinforce the associated parameters. Focusing solely on errors will result in overspecialization. Adding a weighted likelihood term (of the correct class) to the MCE loss function may therefore reduce overtraining. This will tend to reinforce correct substrings, while still penalizing errors. The misclassification measure in Eq. (3.13) then becomes where ~ is the weighting of the additional likelihood term, Pi (0; ()i)’ This modification has been chosen so as to increase the gradient resulting from correct substrings by a factor ~. However, when using a sigmoid loss function, no modification (to misclassification measure or loss function) can be made that will increase the gradient associated with correct substrings by a uniform factor /’i » while not affecting that associated with incorrect substrings. Such a modification can therefore not be mathematically justified when using a smoothed zero-one loss function. It can, however, be implemented as a simple heuristic where the gradient for the correct class is simply multiplied by a weighting factor (1 + /’i,).
Figure 3.10 presents the results for the MCE algorithm using the weighted likelihood term (MCE+ WL) for the small training set (Ts). Significantly, using a sigmoid function does not work nearly as well as the algorithm without a sigmoid loss function when using the weighted likelihood term. Peak performance of 55.0% is attained when using a weight /’i, = 1.5, for the algorithm without a sigmoid loss function (versus 53.8% with /’i, = 0).

Word MCE

Presenting arbitrarily long strings to the string-level MCE algorithm is not optimal. An error at a specific point in time will potentially result in the incorrect segmentation at that time (not just incorrect labelling), and such an error in segmentation will therefore influence the recognition and segmentation of subsequent acoustic units. The usage of a language model will also tend to result in an error at any given point resulting in further errors later in the utterance. Errors occurring earlier during recognition of a string therefore influence the recognition for the rest of the string. Our confidence in the accuracy of segmentation and classification after an error has occurred will therefore tend to be low. As the N-best string outputs from the recognizer are used as discriminative training examples, the number of incorrect strings are limited. Most of these « incorrect » strings differ only in a few places, resulting in only a few potential errors being addressed during discriminative training. To improve the above, presenting smaller word-based strings to the string-level MCE algorithm is investigated. This is particularly appropriate when training speech recognizers on speech databases which have long sentences. This, however, requires that one has a dataset which is also labeled at word level. A sentence would therefore be presented to the string-level MCE algorithm word by word in isolation. The N-best hypotheses (string of phones) would therefore be generated for each word individually (using the relevant part of the utterance) and used to determine the MCE gradients and updates. This is as opposed to the standard string-level MCE algorithm where the N-best hypotheses are generated for the entire sentence.
Figure 3.12 compares results for sentence- and word-based MCE for the small training set, plotted versus the number of training epochs. A sigmoid loss function is not used. The improvement in performance is marked, resulting in a 7.4% relative reduction in error rate.

Summary and discussion of results

Table 3.7 gives a summary of the results when using the different modifications proposed for MCE in this chapter. The full training set (T) is used. MCE training aloneproduces a 17.7% relative reduction in error rate over baseline maximum likelihood (ML) training. However, employing the modifications results in a relative reduction in error of up to 23.3% being attained over ML training. Table 3.8 gives a summary of the results when using the different modifications and the small training set (Ts) is used. Here, standard string-level MCE only results in a 2.5% relative reduction in error rate over baseline maximum likelihood (ML). However, employing the proposed modifications results in a relative reduction in error rate of up to 12.2% being attained. The modifications have more of an effect when less training data is available and overtraining is more prevalent.
Tables 3.7 and 3.8 provide results obtained when combining the word-based string MCE algorithm and the other modifications (penalty and weighted likelihood). Unfortunately, the weighted likelihood term fails to improve upon the performance of the wordbased string-level MCE algorithm for either of the two datasets (MCE+WORD+WL versus MCE+WORD in the tables). This indicates that the effect of the two modifications is similar, which can be seen in the results presented for the individual procedures earlier. Both, for example, reduce degradation in performance after maximum testing set performance is reached. Limited improvements in performance were obtained when the variance penalty term was combined with the word-based string-level MCE algorithm. Significant improvements in performance on the testing sets are obtained using the modifications to MCE as proposed. The modifications proposed are relatively simple to implement and limit overspecialization to some degree. The additional computational expense resulting from the use of the proposed modifications is very small and is not measurable. Although variation in performance did result from the use or non-use of a sigmoid function, there is little evidence to suggest that anyone of the two possibilities is a better choice, with the resultant variation in performance generally being relatively small compared to the improvements in error rate due to the modifications.

1 Introduction
1.1 Adaptation
1.2 Training
1.3 Problem statement
1.4 Organization of this thesis
1.5 Contributions of this thesis
2 Background
2.1 Hidden Markov models
2.2 Overtraining
2.3 Experimental procedure
2.4 Speech datasets
3 Minimum classification error training
3.1 Introduction
3.2 Minimum classification error training
3.3 Embedded MCA
3.4 Discussion and experiments
3.5 Summary
4 Bayesian adaptation
4.1 Introduction
4.2 Monte Carlo methods
4.3 Implementation of Bayesian HMM learning
4.4 Experiments
4.5 Summary
6 Conclusion
Bibliography
A. Probability distributions