# Principle of empirical risk minimization

Get Complete Project Material File(s) Now! »

## Problem of modeling language

Language modeling involves estimating probability distributions of sequences of words. Consider a vocabulary V from which words are drawn to make discrete sequences. Let s = (s1, s2, • • • , s|s|) be a sequence with words si 2 V and |s| being the length or the number of words in the sequence. We use s = (xy) to denote a split of the sequence where y 2 V is the word following the context x 2 V |s| 1. The suﬃx of a sequence x of length m is denoted by xm which is the subsequence with the last m words of x. We wish to design an algorithm A that will learn the probability distribution P(s) over the sequences from a corpus. A corpus is a collection of sequences of words from V . The input to A is a text corpus C that is used to learn the parameters of the model. Apart from the training set C, we will additionally need a validation set, also called held-out data, distinct from C to tune any necessary parameters. We distinguish the training set C from the test set Ct that is used to compare performance of diﬀerent models.
The corpus C is a sampling from the space of sequences, however, space of C is signifi-cantly smaller than the complete space of sequences. This is particularly true for language where numerous possibilities are either meaningless or grammatically incorrect and have zero probability. Further, the space of meaningful and grammatically correct sequences is so varied that this space cannot be sampled suﬃciently. This makes the estimation of joint distribution of words in a sequence P(s) very hard. Hence, the focus is usually the much simpler problem of estimating the following conditional distribution P(y | x) of a word y following the context x. The joint distribution of a sequence is inferred by taking the prod-uct of the conditionals which is equivalent to predicting one word of the sentence at a time in that order. This makes sense for language modeling since the Markovian assumption applies naturally to data of discrete sequences. In practice, one prescribes the maximal length of dependence in the sequences that restricts the distance along the sequence that a word can influence. This is also called the order of the language model. Using sequences
of length at most m leads to an m-gram language model which is also called the (m 1) order language model. A context of length (m 1) is used to predict the following word in such models. In the language models that we will consider in the next section, the order is useful to tabulate occurrences of sequences.
Consider the (m + 1)-gram model with sequences |s| = m + 1. A simple model that counts the number of occurrences to assign a probability from frequency estimates is, PM L(y|xm) = c(xmy) , (1.1) c(xm) where c(s) is the number of times s was observed in C. Taking the product of the proba-bilities for predicting each of the words in a sequence s, we get the joint probability,
PM L(s) = P(y|xm).
(xy)2{(s1,••• ,si)}|is=1|
The product of the probabilities of predicting all the sequences in a corpus C gives the probability of generating C or the likelihood of C, PM L(C) = P(s). s2C
The probability estimate from the distribution defined in (1.1) maximizes the probability PM L(C) of generating the corpus C for a given language order m and hence has the name maximum likelihood estimate.
However, the goal is not to maximize the likelihood of generating a given training corpus C but to minimize the errors in predictions on test set Ct. They are not equivalent since the test set typically has data not seen during training and the maximum likelihood estimator in (1.1) typically assigns them small or zero probabilities because the estimator fits the training set too well. This situation is called overfitting and the resulting error also referred to as generalization error. Overfitting is common in problems where the enough data cannot be collected to learn parameters by simply maximizing the likelihood like in equation (1.1). This is referred to as data sparsity and there is significant amount of work in statistics and machine learning communities to handle this issue for various models.
The distribution from the estimator in (1.1) varies with the value of m. Models with smaller m have fewer parameters to estimate from more data in comparison to those with larger m that will have more parameters. A more confident estimate can be made when a parameter sees more data, a property characterized by smaller variance (often referred to as estimation error), which is the case when m is small. However, a small m assumes short dependence along the sequence discarding information available from the data leading to high bias (also called approximation error) in the estimate.
The assumption of short dependence does not hold since randomly drawing words inde-pendent of their context will return meaningfuless sequences. On the contrary, a large m uses more information available from the training data resulting in smaller bias but since there are fewer observations with longer context lengths the variance of the estimate shoots up. Bias and variance are two competitive properties of an estimator both of which are directly proportional to the generalization error. Estimates from larger m will have higher variance and lower bias and vice versa. It is important to balance bias and variance so that the resulting generalization error is small.
Generalization error can be improved for a fixed amount of data without costing us variance. This is achieved by an appropriately chosen modification of the estimator referred to as regularization or smoothing. The trick is to design models that trade some bias to significantly reduce the variance and eventually improve generalization. The trade-oﬀ between bias and variance is central to statistical models and will be dealt with in detail in Chapter 2 as approximation and estimation errors. In the following section, we will study such modifications to the estimator in (1.1) called smoothing models that improve generalization. We will also study parametric models called maximum entropy models that optimize the likelihood to learn the distributions P(y|xm) from the data. These are approaches that have been traditionally investigated in the language processing literature. In Section 1.3 we will study some recently introduced models for the same problem.

We will see in this section that the statistical estimation process in language modeling suﬀers from the high dimensionality of the problem, often called the curse of dimensionality. It leads to the problem of data sparsity where the datasets are a sparse sampling of the space of sequences. This motivated the design of models that improve performance over the simple estimator in (1.1) for the same amount of data. Traditionally, this has been achieved through smoothing techniques that modify the distribution estimated by (1.1) and regularization in parametric maximum entropy models. We discuss data sparsity and these models in this section.
Data sparsity. In the ideal case of infinitely large number of samples or suﬃciently large dataset, the estimator in (1.1) is suﬃcient to make an estimate with low variance and very good generalization. However, it not possible to acquire a suﬃciently large dataset for high dimensional problems. This is because the space of all possible sequences is too large to be sampled suﬃciently. Specifically in the case of language, the space of all possible sequences of length m with a vocaulary of size |V | is |V |m. Consider a modestly sized vocabulary of the order 104 with a relatively short dependence at m = 3 and we already have an unmanagably large space of 1012 possible sequences. This is the problem of data sparsity where the feature space is too large to be finitely summarized in a statistically representative dataset. A theoretically sound way to handle data sparsity is to exploit the bias-variance trade-oﬀ discussed in the previous section. The idea is to design biased estimators that significantly reduce variance and hence improve generalization. We briefly discuss smoothing methods from the literature that achieve this for m-gram models by applying diﬀerent variations of the estimator in (1.1) above.

### Smoothing models

Consider estimating the probability of a sequence xmy that was never seen in training corpus C. We assume there are no words out of the vocabulary V . The estimator in (1.1) simply sets its probability to zero which is inappropriate since the training data has not seen all possible sequences. It often overestimates the probability for sequences observed in C and underestimates it for unseen sequences. This overfitting to training data degrades predictive performance of the model. Smoothing models adjust for this overfitting by mod-ifying the predictive distributions. They achieve this in two diﬀerent ways; redistribution of probability mass and mixing distributions of multiple orders. Methods that redistribute mass explicitly modify the counts by some chosen heuristic. Models that mix multiple orders rely on lower orders when higher orders are unobserved. We briefly mention some of these methods here that are detailed in the Appendix A.1 (see Chen and Goodman, 1996, 1998, for a complete survey comparing various smoothing models). Redistribute probability mass. Additive smoothing is the technique of adding some fake samples to original frequencies as if they were observed in C. This is the classical Laplace smoothing for categorical data which is a kind of shrinkage estimator (Lidstone, 1920; Jefreys, 1948) and has the following form: PLP (y|xm) = c(xmy) + d , with 0 < d 1. (1.2) c(xm) + |V |d where d is chosen arbitrarily. For d > 0, all unobserved sequences are assigned some mass but Gale and Church (1994) argued that this is poor choice for smoothing. Alternatively, the Good-Turning estimator redistributes the probability mass of sequences that occurred exactly (k + 1) times in C among the sequences with frequency k, PGT ( y|xm, c(xmy) = k ) / (k + 1) Ck+1 , (1.3) E[Ck] where Ck is the number of sequences with frequency k and E[ • ] denotes expectation. It redistributes the mass of unit frequency sequences to sequences that we might encounter at test but were never seen in C. This, however, requires us to make an estimate of the number of unseen sequences that we might encounter at testing phase. Good (1953, 2000) explains the details of this estimator.
In estimating P(y|xm), the simplest way to avoid zero counts is to lower the order m and look for a smaller context in C. This is the back-oﬀ model proposed by Katz (1987) and the order is lowered until xmy is found in C. In the extreme case, m goes to zero and it uses the unigram probability of predicting y when no context is observed. This has to be done such that the resulting values sum to one over all y 2 V for any given context. This requires weighting the lower orders by ⌘(xm). The resulting distribution takes the following recursive form, PKZ (y|xm) = ( P(y xm) use highest order model if c(xmy) > ⌧ ⌘(xm| )PKZ (y xm 1) otherwise, (1.4) where ⌧ is a threshold chosen to avoid assigning mass to sequences that occur too few times. The value for ⌧ could be chosen by cross-validating performance on held-out data but it is not very critical and is often set to zero (Chen and Goodman, 1996). The probabilities P(y|xm) at the highest order can come from any estimator with uniform distribution for the null-context (m = 0), P(y) = 1/|V |. The weighting function ⌘(xm) is computed by redistributing the left-over mass from the highest order distributions to the lower order distributions, ⌘(xm) = (xm) , (xm) = 1 | X P(v|xm) . v 2{ z c(xmz) ⌧ } PKZ (v xm 1) 2{ }
Interpolation smoothing. A natural extension of the back-oﬀ model is a weighted combi-nation of the highest order and the lower order distributions. This was proposed by Jelinek and Mercer (1980) in their linear interpolation model that results in a distribution of the following recursive form, PJM (y|xm) = (xm)P(y|xm) + (1(xm))PJM (y|xm 1).
We can recover the back-oﬀ equation in (1.4) for weighting 1 (xm) = ⌘(xm) when the corresponding lowest order distribution is P(y|xm 1). The weights (xm) are learned using held-out data and they should take values that increase with the frequency of the context xm. But since there are too many values to be learned, they are pooled into buckets and all weights in a bucket are assigned the same value (Bahl et al., 1983). Witten and Bell (1991) argue that the interpolation weight (xm) for a certain con-text xm should be proportional to the number of distinct words following the context, C1+(xm•) = | {z | c(xmz) 1} | with the • mapping to any word in V (also see Bell et al., 1990).
Kneser-Ney smoothing. A straightforward alternative to linear interpolation is to ex-plicitly subtract some counts from the highest order distribution and appropriately reweigh the lower order distribution. This method introduced by Ney et al. (1994) is called absolute discounting and has the following form,
P (y x ) = max{c(xm) 0} + (1(x ))P (y x ),
where is the discount parameter tuned on validation dataset. The weighting function for the lower orders takes the form 1 (xm) = 1+(xm•) , c(xm) so that the mass of the resulting distribution adds to one. Performances of absolute dis-counting were comparable to that of linear interpolation with fewer parameters to tune (single parameter of ) and it also agrees with intuition of Witten and Bell (1991) for weighting interpolated models. Kneser and Ney (1995) extended absolute discounting by using modified counts for the lower order distribution so that it is optimized for situations where the higher order counts are small. This modification follows the idea that the lower order probability should be proportional to the number of unique words preceding the sequence. The resulting Kneser-Ney smoothing formulae are,
> { abs. disc. } + back-oﬀ term c(xmy) > 0,
c(xm 1) c(xm) • PKN (y|xm 1)
> z }| { z }| {
PKN (y xm) = >
> max c(xmy) 0 1+(xm ) largest m s.t.
max C1+( xm 1y) 0 1+(xm 1
> { C1+(•xm 1•)
> C1+(•xm 1•) KN | m 1
< • } + • P (y x ) otherwise,
where C ( x > | {z } | {z }
: ) = ( ) 1 interpolation term
> modified lower order dist.
1+ • m y | { z | c z x m y } | counts the number of unique words preceding xmy and C1+(•xm 1•) = | {(z1, z2) | c(z1 xm 1 z2) counts the number of distinct pairs sandwiching xm 1. When xm 1 is null, C1+(••) takes the number of bigrams in C as it value. The discount parameter is chosen by tuning on validation dataset or set using heuristics (Ney et al., 1994). This formulation when applied recursively up to unigrams is called interpolated Kneser-Ney. Kneser-Ney smoothing has been shown to consistently outperform other smoothing models (Chen and Goodman, 1996). Chen and Goodman (1998) also observe that varying the discount factor with the length of the sequence |xmy| it is applied on further improves performance.
Discussion on smoothing techniques. Smoothing techniques have long delivered good performance and are still preferred over other models for time intensive applications. Vari-ants of Kneser-Ney are preferred over other methods owing to their superior performance. However, the suitability of a method depends on various factors like the amount of training data available, constraints on response time and so on. For example, Brants et al. (2007) show how they achieve performance close to Kneser-Ney for machine translation with a computationally cheaper and faster back-oﬀ model by significantly increasing the amount of data.
On the positive side, smoothing models are very well engineered and thoroughly studied models that are used in numerous applications but they lack certain desirable properties. Chen and Goodman (1996) study various smoothing models and observe that the flex-ibility oﬀered by models with more parameters is one important aspect that improves performance. These models with parameters that are tuned on held-out data are observed to perform better than other ones. While this is the case with some conventional m-gram models, the flexibility they oﬀer is very limited. The necessity of parametric models that can learn from data goes beyond improvements in predictive performance. They open possibilities to custom design models for specific tasks and fine tune them using data. We further discuss this in Section 1.5. This requirement motivates the modeling of language using parametric models. In the next section we describe such parametric modeling of language using the maximum entropy principle.

READ  CONSTRUCTION APPROACH, CONSTRUCTION STEPS AND EVALUATION

#### Maximum entropy models

The task is to estimate a distribution explaining the given data using a parametric model. ‘Explaining data’ is used to imply that the distribution must be similar to the empirical dis-tribution observed from the data but not equivalent to avoid overfitting. More precisely, the expectation of random variables under the distribution should equal their empirical observations from the data. There could be multiple such distributions with expectations matching empirical observations for any given dataset. The principle of maximum entropy states that from among various candidate distributions for a dataset, the one with high-est entropy must be chosen. Entropy is the measure of uncertainty in a random process and a uniform distribution has the highest entropy. A distribution chosen by this cri-terion is closest to uniform distribution that satisfies the above said constraint for that dataset. Maximum entropy methods have been studied extensively in the statistical lan-guage processing literature (see Rosenfeld, 1998; Lau, 1994; Lau et al., 1993; Rosenfeld, 1996; Ratnaparkhi, 1998; Berger et al., 1996). Maximum entropy models bear relation to Gibbs and Boltzmann distributions in statistical physics, log-linear models and Markov random fields in probability theory and Boltzmann machines in neural networks. They oﬀer interesting properties as described by Della Pietra and Della Pietra (1993).
Let { i(s)}di=1 be a set of binary random variables or features defined over sequences s. We use contextual features where i(s) indicates the occurrance of a certain suﬃx in the sequence s that we describe shortly. Alternatively, one could use a diﬀerent feature set and build other models. For example, Lau (1994) uses the occurance of certain words called triggers in s to build a trigger-based language model. In a more generic setting, i could take any non-negative value that quantifies a certain property like, for example, a quantity proportional to the length of s. If PM E is the maximum entropy distribution then the expectation of i(s) must satisfy, PM E (s) i(s) = PM L(s) i(s), s2C s2C where PM L is the empirical distribution from equation (1.1). The maximum entropy distribution is one that satisfies all such constraints laid down by C and has the maximum entropy among them. Essentially, these methods choose a distribution that makes no assumptions beyond that of the data, in our case C, and the choice of representation. We now describe a model based on this maximum entropy principle.
Consider a model with one parameter !i associated to each feature i. Define an event T associated with a subset of the random variables ST ⇢ { i}di=1. The maximum entropy model assigns a probability to T that is proportional to the product of the exponentials of all parameters !i corresponding to relevant features ST or, PM E (T ) /e!i . (1.6)
The parameters of the model are estimated by maximizing the probability of observing the training data. We now look at the specific set of features and the estimation method for modeling language.
Maximum entropy language model. Consider a language model of order m that requires estimating PM E (y|xm). Let each sequence s of length at most (m+1) in the training corpus C be a feature, S = {s | s 2 C ^ |s| m + 1}.
Associate a parameter !i to each feature in S resulting in a model with d = |S| number of parameters. The feature set gets richer upon increasing the amount of data but so do the number of parameters that need to be estimated.
Given a sequence (xmy), our features { i(xmy)} are occurrences of S as suﬃxes of (xmy). This can be described by an indicator function I(xmy, i) that assigns one to i(xmy) if Si and xmy have a common non-null suﬃx, i.e., if xky = Si for some 0 k m. The indicator
function I(xmy, i) takes the following form:
i(xmy) = I(xmy, i) = ⇢ 1 if xky = i for some 0 k m,
0 otherwise.S
Now, we need to write the probability PM E (y|xm) as defined in equation (1.6). This requires us to sum all parameters of features relevant to the event of y following the context xm. The conditional probability is the exponential of this sum, P (y|x ; !) / ePd=1 !iI(xmy,i).
Actual probabilities are computed by taking ratios with the normalization factor, also called the partition function, that comes from marginalization of y over all V ; Z(xm; !) = eP !iI(xmv,i). (1.7) i=1 The corresponding probability of generating a corpus which is also the likelihood of the corpus L(C; !) would be,
{ Y Y eP d !iI(xmy,i)
L(C; !) = PME(C; !) = PM E (y|xm; !) = i=1 . (1.8)
{xmy}2C Z(xm; !)
xmy}2C
The optimal set of parameters is recovered by maximizing L(C; !) which is usually done by first taking the logarithm. The negative of the log of the likelihood takes the following form:
L(C; !) = { X ln PM E (y|xm; !), and
!⇤ xmy}2C (1.9)
= argmin L(C; ! , !2Rd )
where !⇤ is the set of optimal parameters in Rd. We briefly review below the methods to solve the optimization in (1.9).
Iterative scaling methods. The optimization described in (1.9) is convex with a unique minimum and any suitable optimization method could be applied depending on the size of the problem. The earliest technique used is the iterative scaling introduced by Darroch and Ratcliﬀ (1972). This method involves minimizing the diﬀerence in the likelihood at two diﬀerent parameter estimates separated by Δ!.

Détail des contributions
0.1 Modélisation de la langue naturelle
0.2 Méthodes existantes
0.2.1 Méthodes par lissage
0.2.2 Entropie maximale et maximum de vraisemblance
0.2.3 Méthodes récentes
0.3 Motivations
0.4 Contributions et plan de la thèse
0.4.1 Pénalités structurées pour la modélisation du langage
0.4.2 Apprentissage de la taxonomie
0.4.3 Plan de la thèse
1 Introduction
1.1 Problem of modeling language
1.2.1 Smoothing models
1.2.2 Maximum entropy models
1.2.3 Predictive performance and computational efficiency
1.3 Recent trends
1.3.1 Bayesian models
1.3.2 Distributed representations
1.4 Summary and motivation
1.5 Contributions and outline
2 Learning with structured penalties
2.1 Principle of empirical risk minimization
2.2 Penalized loss and structural risk minimization
2.2.1 Unstructured penalties
2.2.2 Structured penalties
2.3 Optimizing penalized loss
2.4 Proximal minimization algorithms
2.4.1 Proximal operators for penalties
2.5 Conclusion
3 Log-linear language model
3.1 Introduction to Generalized Linear Models
3.2 Log-linear language model
3.3 Suffix encoding in tries and trees
3.3.1 Suffix trie structured vectors
3.3.2 Word-specific suffix trie structured vectors
3.3.3 Constraining to positive orthant
3.3.4 Word-specific suffix tree-structured vectors
3.3.5 Complexity improvements from trees
3.4 Models with unstructured penalties
3.4.1 Proximal projection with unstructured penalties
3.4.2 Performance evaluation with unstructured penalties
3.5 Models with structured penalties
3.5.1 Proximal projection with `T 2 -norm
3.5.2 Proximal projection with `T 1 -norm
3.5.3 Performance evaluation with structured norms
3.5.4 Feature weighting to avoid overfitting
3.5.5 Analysis of perplexity improvements
3.6 Time complexity of proximal operators
3.6.1 Properties of `T 1 -norm
3.6.2 Fast proximal projection with `T 1 -norm
3.7 Conclusion
4 Efficient taxonomy learning
4.1 Normalization in multi-class log-linear models
4.2 Tree embedding of classes
4.2.1 Measuring the class proximity
4.2.2 Split and merge operations
4.2.3 Top-down procedure
4.2.4 Bottom-up procedure
4.3 Node-specific classifiers
4.4 Evaluation of taxonomy methods
4.5 Conclusion
5 Concluding remarks
Appendices
A Smoothing and maximum entropy models
A.1 Smoothing models
A.1.1 Redistribute mass to handle unseen sequences
A.1.2 Avoid overfitting using lower order distributions
A.2 Iterative scaling methods
A.2.1 Generalized Iterative Scaling
A.2.2 Improved Iterative Scaling
A.2.3 Sequential Conditional Generalized Iterative Scaling
B Synthetic Data Generation
C Effect of parameters

GET THE COMPLETE PROJECT