Continuous Space Neural Network Language Models

Get Complete Project Material File(s) Now! »

Smoothing Techniques

Conventional smoothing techniques, such as Kneser-Ney and Witten-Bell, use lower order distributions to provide a more reliable estimate, especially for the probability of rare n-grams. (Chen and Goodman, 1998) provides an empirical overview and (Teh, 2006) gives a Bayesian interpretation for these methods.
These techniques aim at smoothing the MLE distributions in two steps: (i) discounting a probability mass for observed n-grams; (ii) redistributing this mass to unseen events. Generally, distributions are adjusted in such a way that small probabilities are increased and high probabilities are diminished to prevent zero probabilities and improve the model prediction accuracy. They are divided into two types: interpolated and back-off that differ crucially in the way of determining the probabilities for observed n-grams: While the former uses lower order distributions, the latter does not. According to (Kneser and Ney, 1995), back-off smoothing techniques can be described using the following equation: P(wnjwn􀀀1 1 ) = ( a(wnjwn􀀀1 1 ) if c.

Class-based Language Models

As an attempt to make use of the similarity between words, class-based language models were introduced in (Brown et al., 1992) and then, deeply investigated in (Niesler, 1997). First, words are clustered into classes. The generalization is then achieved based on the assumption that the prediction for rare or unseen word n-grams can be made more accurate by taking into consideration the richer statistics of their associated (less sparse) class n-grams. In general, the mapping from words to classes can be many-to-one or many-to-many, corresponding respectively to hard and sort clusterings. Soft clustering requires a marginalization over all possible classes in the calculation of the n-gram probabilities, which is intractable in most cases. Therefore, in practice, words are often supposed further to belong to only one class.
There are two major issues in the class-based approach. The first one concerns the way of using word classes in estimating word probabilities. Let ki be the class of word wi, then several ways could be followed, for example, as in the original approach: P(wnjwn􀀀1 1 ) = P(wnjkn)P(knjkn􀀀1 1 )

Structured Language Models

Structured Language Models (SLMs) (Chelba and Jelinek, 2000; Roark, 2001; Filimonov and Harper, 2009) are one of the first successful attempts to introduce syntactic information into statistical language models. The main idea is to use the syntactic structure (a binary parse tree) when predicting the next word so as to filter out irrelevant history words and to focus on the important ones.
In practice, as represented in Figure 1.1 in case of trigrams, to predict the last word “after”, in lieu of using the two previous words “of cents” as would n-gram based LMs, SLMs use the two last exposed headwords (heads of phrases) according to the parse tree: “contract” and “ended” which are intuitively, stronger predictors. Compared to standard n-gram LMs, an advantage of SLMs is that they can use long distance information.
The algorithm proceeds as follows: from left to right, the syntactic structure of a sentence is incrementally built. As for each sentence, there is not a unique possible parse tree T, SLMs need to estimate the joint probability P(wL 0 , T) and then induce the probability of the sentence by marginalizing out the variable T. SLMs consist of three modules:
WORD-PREDICTOR predicts the next word given context words and their associated POS tags.
TAGGER predicts the POS tag of the next word given this word, context words and their associated POS tags.
CONSTRUCTOR grows the existing parse tree of context words with the next word and its POS tag.

Enhanced training algorithm

A new training scheme for the class part of SOUL NNLMs used to deal with OOS words was proposed in (Le et al., 2011a). It makes a better use of available data during training, trying to provide more robust parameter estimates for large vocabulary tasks. Application of this scheme results in additional improvements in perplexity and system performance.
The main reason for proposing this new scheme comes from the limitation of resampling techniques, described in Section 1.4.3.2. Resampling of training data is conventionally used as it is computationally unfeasible to train an NNLM on the same amount of data as is used a conventional n-gram language models. Usually, at each epoch, training examples up to several million words are randomly selected. When dealing with large vocabularies, the number of parameters at the output part of SOUL NNLMs is much larger than that of shortlist based NNLMs. As a result, using the same number of resampled examples as for shortlist based NNLMs to train SOUL NNLMs may be insufficient to obtain robust parameter estimates.
To make it clearer, let us consider again the output part of SOUL NNLMs. It comprises of two parts: the first one which contains the main softmax layer which directly models the probabilities of the most frequent in-shortlist words and top classes for OOS words; the second one which is composed of the remaining softmax layers used to deal with OOS words as displayed in Figure 2.1. The parameters related to the first part are updated for all training examples since they cover the most frequent (in-shortlist) words and the top (most general) classes for the less frequent (OOS) words. The n-grams ending with in-shortlist words are used to update the parameters only of the main softmax layer, leaving the other layers intact. The parameters of the other layers are updated with the n-grams ending in an OOS word and only those layers leading to the leaf with this particular word are activated and the corresponding parameters updated. A shortlist usually covers a large part of training examples so that the updates of the parameters related to the second part are less frequent. Moreover, when such an update occurs, it is only performed for a small subset of the parameters corresponding to a particular path in the clustering tree. At the same time, the number of parameters of this second part is much larger2. As a result, the two parts of the SOUL output layer are not equally well trained.

READ Mapping cancer and disease gene duplications on Ensembl duplication nodes .

Experimental evaluation

The advantages of new proposed SOUL NNLMs have been empirically demonstrated in various contexts, both for ASR and SMT large scale tasks with different languages (Mandarin, Arabic, French, English . . . ). In this section, we summarize our several recent experimental results reported in several publications (Le et al., 2011b; Le et al., 2011a; Allauzen et al., 2011).

Table of contents :

I Language Modeling and State-of-the-art Approaches
1 Language Modeling
1.1 Introduction
1.2 Evaluation Metrics
1.2.1 Perplexity
1.2.2 Word Error Rate
1.2.3 Bilingual Evaluation Understudy
1.3 State-of-the-art Language Models
1.3.1 Smoothing Techniques
1.3.1.1 Absolute discounting
1.3.1.2 Interpolated Kneser-Ney smoothing
1.3.1.3 Stupid back-off
1.3.2 Class-based Language Models
1.3.3 Structured Language Models
1.3.4 Similarity based Language Models
1.3.5 Topic and semantic based Language Models
1.3.6 Random Forest Language Models
1.3.7 Exponential Language Models
1.3.8 Model M
1.4 Continuous Space Language Models
1.4.1 Current Approaches
1.4.1.1 Standard Feed-forward Models
1.4.1.2 Log-bilinear Models
1.4.1.3 Hierarchical Log-bilinear Models
1.4.1.4 Recurrent Models
1.4.1.5 Ranking Language Models
1.4.2 Training algorithm
1.4.2.1 An overview of training
1.4.2.2 The update step
1.4.3 Model complexity
1.4.3.1 Number of parameters
1.4.3.2 Computational issues
1.4.4 NNLMs in action
1.5 Summary
II Continuous Space Neural Network Language Models
2 Structured Output Layer
2.1 SOUL Structure
2.1.1 A general hierarchical structure
2.1.2 A SOUL structure
2.2 Training algorithm
2.3 Enhanced training algorithm
2.4 Experimental evaluation
2.4.1 Automatic Speech Recognition
2.4.1.1 ASR Setup
2.4.1.2 Results
2.4.2 Machine Translation
2.4.2.1 MT Setup
2.4.2.2 Results
2.5 Summary
3 Setting up a SOUL network
3.1 Word clustering algorithms
3.2 Tree clustering configuration
3.3 Towards deeper structure
3.4 Summary
4 Inside the Word Space
4.1 Two spaces study
4.1.1 Convergence study
4.1.2 Word representation analysis
4.1.3 Learning techniques
4.1.3.1 Re-initialization
4.1.3.2 Iterative re-initialization
4.1.3.3 One vector initialization
4.2 Word representation analysis for SOUL NNLMs
4.3 Word relatedness task
4.3.1 State-of-the-art approaches
4.3.2 Experimental evaluation
4.4 Summary
5 Measuring the Influence of Long Range Dependencies
5.1 The usefulness of remote words
5.1.1 Max NNLMs
5.1.2 Experimental evaluation
5.2 N-gram and recurrent NNLMs in comparison
5.2.1 Pseudo RNNLMs
5.2.2 Efficiency issues
5.2.3 MT Experimental evaluation
5.3 Summary
III Continuous Space Neural Network Translation Models
6 Continuous Space Neural Network Translation Models
6.1 Phrase-based statistical machine translation
6.2 Variations on the n-gram approach
6.2.1 The standard n-gram translation model
6.2.2 A factored n-gram translation model
6.2.3 A word factored translation model
6.2.4 Translation modeling with SOUL
6.3 Experimental evaluation
6.3.1 Tasks and corpora
6.3.2 N-gram based translation system
6.3.3 Small task evaluation
6.3.4 Large task evaluations
6.4 Summary
7 Conclusion
A Abbreviation
B Word Space Examples for SOUL NNLMs
C Derivatives of the SOUL objective function
D Implementation Issues
E Publications by the Author
Bibliography