The computational bottleneck
The main limitation of neural networks language models is their expensive computational cost. This was already pointed out by Bengio et al. (2003). Schwenk and Gauvain (2004) provide, for the feedforward model, the number of operations needed to obtain the output distribution: ((n 1) dr + 1) dh + (dh + 1) jVj (1.9).
Since, for a language model, the quantities dr and dh verify dr jVj and dh jVj, the computational cost mainly depends on the size of the vocabulary, with which it grows linearly. Early work on neural network language models proposed a certain number of training tricks to accelerate training, among which using a mini-batch, i.e parallelizing computation by forwarding (and backpropagating the gradient of) multiple examples at the same time, instead of just one. In practice, it is very ecient, and it only requires transforming the vectors x, h and P of Section 1.2 into matrices. However, the most straightforward way to limit computational time is to limit the vocabulary size. Bengio et al. (2003) proposed to remove words under a frequency threshold of the vocabulary, and to map them to a same token UNK, which represents all unknown words. That is what we will do when we do not use the full training vocabulary. An improvement on that idea is the short-list, (Schwenk and Gauvain, 2004) which limits the neural language model vocabulary size, and uses the probabilities obtained from a discrete language model for the remaining words. However, this solution limits the model capacity to generalize to less frequent words. While the training and inference costs of a neural language model are always making their practical use dicult, we will in the next chapter mainly be interested in reducing them when they are the most impactful: for language models that use a large vocabulary.
Hierarchical language models
To avoid the costly summations in the multinomial classication objective, we can modify the output layer to make this classication hierarchical. We rst predict a (series of) class or cluster and, given that class, predict the following word. While class-based models have been used extensively in language modeling (Brown et al., 1992; Kneser and Ney, 1993), they aimed at improving the model perplexity, or reducing its size. Decomposing the vocabulary into classes in order to get a speedup in computation was rst applied by Goodman (2001a), to a Maximum entropy language model. An extension of this idea was then applied to Neural probabilistic language models, rst by Morin and Bengio (2005). Here, the prediction of a word from a vocabulary V is replaced by a series of O(log jVj) decisions. That process can be seen as a hierarchical decomposition of the vocabulary, following a binary tree structure. Indeed, by building a binary hierarchical clustering of the vocabulary, we can represent each word as a bit vector b(w) = (b1(w); : : : ; bm(w)) and compute the conditional probability of the next word as: P(wjH) = Ym i=1 P(bi(w)jbi1(w); : : : ; b1(w);H).
Density estimation as a classication task: discriminative objectives
Noise Contrastive Estimation (NCE) was rst described in Gutmann and Hyvarinen (2010, 2012), as a way of estimating a parametric probabilistic model from observed data, in the case where the probability function of the model is un-normalized. The rst idea is to consider the partition function Z as a separate parameter, instead of a value dependent on all the other parameters . Then, a parametrized distribution P is decomposed as: log P = log p0 + c.
with parameters = (0; c). Here, c = log Z gives the un-normalized model proper scaling, while the other parameters make the shape of the model match the shape of the data density distribution. However, estimating separately 0 and c is not possible with maximum-likelihood estimation, since we can simply choose c to be as large as we want to increase likelihood. The authors then propose an objective function which mimics maximum-likelihood estimation by learning to discriminate between examples from data or generated from a noise distribution. This method has been applied to language modeling, as well as other approaches which also use discriminative objectives and that will be described subsequently.
Noise Contrastive Estimation
With NCE, we learn the relative description of the data distribution PD to a reference noise distribution Pn, by learning their ratio PD=Pn. This ratio is learned by discriminating between the two distributions. Concretely, we draw k samples from the noise distribution for each tuple (H;w) 2 D and optimize our model to perform a classication task between them. We can consider our example as coming from the following mixture.
Avoiding normalization by constraining the partition function
For some specic applications, focused on reducing the inference cost of a Neural Probabilistic Language Model, the techniques presented earlier are not a good solution. Hierarchical approaches are still costly and importance sampling requires normalization at testing time. Techniques based on discriminative objectives are fast during training, but since self-normalization is dicult to monitor, using a softmax may be required during testing. To be able to eciently use a NPLM to decode for a machine translation system, Devlin et al. (2014) introduced Self-normalized Neural Networks, which they want to be able to use for inference without performing a softmax explicitly. Instead, they add an explicit constraint in their objective function that makes the partition function Z(H) as close to 1 as possible during training: NLLSelfnormalized() = X (H;w)2D s(w;H) log2 Z(H) .
Table of contents :
List of Figures
List of Tables
1 From Discrete to Neural Language Models
1.1 Discrete language models
1.2 Neural network language models
1.2.1 Feedforward language models
1.2.2 Recurrent neural network language models
1.3 Practical considerations
1.3.2 Choosing hyperparameters
1.3.3 The computational bottleneck
2 Avoiding direct normalization: Existing strategies
2.1 Hierarchical language models
2.2 Importance Sampling
2.2.1 Application to Language Modeling
2.2.2 Target Sampling
2.2.3 Complementary Sum-Sampling
2.3 Density estimation as a classication task: discriminative objectives
2.3.1 Noise Contrastive Estimation
2.3.3 Negative Sampling
2.4 Avoiding normalization by constraining the partition function
3 Detailled analysis of Sampling-Based Algorithms
3.1 Choosing k and Pn: impact of the parametrization of sampling
3.1.1 Eects on Importance Sampling
3.1.2 Eects on Noise-Contrastive Estimation
3.2 Impact of the partition function on the training behaviour of NCE
3.2.1 Self-normalization is crucial for NCE
3.2.2 Inuence of the shape of Pn on self-normalization
3.2.3 How do these factors aect learning ?
3.3 Easing the training of neural language models with NCE
3.3.1 Helping the model by learning to scale
3.3.2 Helping the model with a well-chosen initialization
3.3.3 Summary of results with sampling-based algorithms
4 Extending Sampling-Based Algorithms
4.1 Language model objective functions as Bregman divergences
4.1.1 Learning by minimizing a Bregman divergence
4.1.2 Directly learning the data distribution
4.2 Learning un-normalized models using Bregman divergences
4.2.1 Learning by matching the ratio of data and noise distributions
4.2.2 Experimenting with learning un-normalized models
4.3 From learning ratios to directly learning classication probabilities .
4.3.1 Minimizing the divergence between posterior classication probabilities and link to NCE
4.3.2 Directly applying -divergences to binary classication
5 Output Subword-based representations for language modeling
5.1 Representing words
5.1.1 Decomposition into characters
5.1.2 Decomposing morphologically
5.2 Application to language modeling
5.3 Experiments on Czech with subword-based output representations
5.3.1 Inuence of the vocabulary size
5.3.2 Eects of the representation choice
5.3.3 Inuence of the word embeddings vocabulary size
5.4 Supplementary results and conclusions
5.4.1 Training with improved NCE on Czech
5.4.2 Comparative experiments on English
List of publications
A Proofs on Bregman divergences
B Subword-based models: supplementary results with NCE
C Subword-based models: supplementary results on embedding sizes inuence
D Previous work on subword-based POS tagging