Get Complete Project Material File(s) Now! »

## Predictive performance and computational efficiency

Maximum entropy models are simple and intuitive with flexibility to capture information of various types. For example, the model described above captures the information encoded in m-gram models and Lau et al. (1993) show how trigger-based models and cache models can also be absorbed into this framework. These models have been used for various kinds of language applications like disambiguation, sentence boundary identification, part-of-speech tagging, and numerous others. However, there are two major issues with the framework that need to be addressed; one relating predictive performance of the resulting model and the other of its computational efficiency. We discuss these aspects in detail below.

1. Overfitting. The parameters recovered by minimizing L very well fit the data in C but fitting it too well will cause it to degrade performance on any other sample C0 that deviates from C. This phenomenon of a model fitting the training data too well is called overfitting. Smoothing of parameters, similar to conventional n-gram models is used to avoid overfitting.

Regularization by constraint modification. Smoothing in maximum entropy models involves modifying the marginal equivalence constraint in (1.5). It is important to note that if the modification leads to an inconsistent set of constraints, the problem becomes infeasible and the solution space will be empty. Different modifications lead to different changes in the feasibility set resulting in different solutions (Chen and Rosenfeld, 2000).

The earliest choice studied for smoothing of parameter estimates was to drop some of the constraints from the set described by the corpus C in (1.5). This widely used method is called constraint exclusion and various variants of this idea have been studied by Della Pietra and Della Pietra (1993); Lau (1994); Rosenfeld (1996) and Della Pietra et al. (1997). Typically, constraints corresponding to sequences with very low frequencies are the ones that are excluded. The resulting model is observed to be very close to conventional n-gram models with a positive threshold on the frequency count (Lau, 1994; Rosenfeld, 1996). Replacing the empirical estimate on the right side of (1.5) with a discounted value is called Good-Turing smoothing of maximum entropy models as described by Lau (1994) and Rosenfeld (1996). The resulting model is known to be similar to Jelinek-Mercer interpolation smoothing for conventional n-grams models described above (Lau, 1994). Alternatively, Newman (1977) and Khudanpur (1995) relax the equality conditions in (1.5) to define an upper and a lower bound on the marginals under the resulting distribution PME to smoothen the estimates. The resulting inequalities are called fat constraints. An alternative is to optimize the unconstrained objective in (1.9) with an additional penalty term that adjusts for parameter smoothing.

Regularization by penalization. Smoothing by adding a regularization term (also called penalty) to the maximum likelihood formulation was considered as a simpler and efficient alternative to modifying the constraint set. Various regularizers have been tried in the context of maximum entropy models and a survey by Chen and Rosenfeld (2000) reported that they perform better than models smoothed by constraint modifications.

Chen and Rosenfeld (1999) propose using a Gaussian prior on the parameters which is equivalent to imposing an `2-norm on the parameters (Della Pietra and Della Pietra, 1993; Chen and Rosenfeld, 2000). Penalties similar to the `1-norm have been also been attempted by Goodman (2004). A combination of the two penalties, `1 and , similar to the elastic net has been tried by Kazama and Tsujii (2003) and Chen (2009). The `2 penalty and its combination with the `1-norm have shown improvements (Chen, 2009; Alumae and Kurimo, 2010) while the `1 penalty alone was less satisfactory (Goodman, 2004) indicating that uniform sparsity is not an appropriate prior for modeling language. These modifications lead to appropriate changes of the update equations in Table 1.1. We discuss the issue of overfitting in detail in Chapter 2 and then in Chapter 3, we propose using structure incorporating penalties to steer the optimization to a more appropriate model.

2. Computation of the normalization factor. The update at every step of the optimization in Table 1.1, requires estimating the expectation of some quantity under the distribution PME(·) which depends on the current estimate !. This means, we will need to recompute the expensive normalization factor Z at every iteration of the optimization. Wu and Khudanpur (2000) present a trick that improves the time required for computing Z by storing calculations that are common in multiple terms of Z. The time complexity of Z is still linear in the vocabulary V . Goodman (2001) reduces this dependence on V by clustering words. This breaks into into a two-stage problem of choosing the best cluster followed by picking the best word from the chosen cluster. This idea can be generalized to multiple levels of hierarchy that in the ideal case bring the complexity down to logarithmic in V . The hierarchy, however, should be such that the resulting classifier has predictive performance close to the original one. The hierarchy is usually built ad-hoc (Mnih and Hinton, 2008) or built using the standard Brown clustering (Brown et al., 1992; Goodman, 2001) . In Chapter 4, we propose an alternative to build this hierarchy that is very generic and applicable to build taxonomies for various problems.

### Penalized loss and structural risk minimization

In order to ensure good predictive performance, we need to minimize the excess risk e in (2.3). Since, we only have access to a finite sample of the actual distribution, this generalization is achieved by controlling the capacity of H. Define a parametric family of hypotheses H(-) with the parameter – controlling the capacity of the resulting class. We modify the empirical estimate of the risk in (2.4) to append the loss f with a penalty term » :H![0,1] that depends only on h and not on the sample Sn. The resulting function called the structural risk is, Rf+$ » Sn [h] = Xn i=1 1 n f(h(xi), yi) + -« (h).

#### Log-linear language model

Let V be the set of words or more generally symbols, which we call vocabulary. We are given samples in the form of a text corpus Sn ⇠ C with words from V and are required to assign probabilities to sequences of words of length m+1. Similar to themaximumentropy methods considered in Chapter 1, we discriminatively model the probability of predicting a word y from its context xm, P(y|xm). The joint distribution of a sequence s comes from predicting each of its words from the corresponding context.

This is the case of |V |-way choice problem where we must choose one word from a list of |V | choices. Such situations are explained using the “categorical” distribution1 over the vector of probability mass for different words. Probability for each word itself comes from the log-linear model described above as ⇡v(x) / ew> v +m(x) leading to the following: y|x ⇠ Categorical (⇡(x)) = Y v2V ⇡v(x)I[y=v].

**Suffix encoding in tries and trees**

Suffix trees provide an efficient way to store and manipulate discrete sequences as they can be constructed in linear time due to Ukkonen (1995) when the vocabulary is fixed (also see Giegerich and Kurtz, 1997). They are extensively used in natural language processing and bioinformatics. Recent examples include language models based on a variable-length Markovian assumption (Kennington et al., 2012) and the sequence memoizer (Wood et al., 2010).

The data structure encodes all the unique suffixes observed in a sequence up to a maximum given length. It is organized as a tree with nodes lower in the tree holding longer suffixes. We distinguish a suffix trie from a suffix tree here. A trie stores the suffixes with one suffix per node. A tree, however, collapses paths in the trie that do not branch. A suffix tree could have multiple suffixes per node while the hierarchy still exists.

This collapsing of such paths into nodes saves memory required to encode the sequence. Each internal node in a suffix tree has at least two children while those in a suffix trie have one or more children. The number of nodes in the tree scales by a significantly lower order than that of the trie with increasing length of the sequence. The exact relation between the sequence length and the node count varies from one application to another. In language, this has been observed to be linear by Wood et al. (2010).

**Suffix trie structured vectors**

The feature vector 1m(x) encodes suffixes (or contexts) of increasing length and it cor- responds to one path of length m starting at the root of S(X). Thus, when building a log-linear language model, it is sufficient to consider weight vectors of size |S(X)|. In other words, there is one weight parameter per suffix per symbol and the matrix of parameters W is of size |S(X)||V |. As illustrated in Figure 3.3(a), each column of W contains the weight vector wv associated with symbol v 2 V and |S(X)| = 13. The dimension of the weight vectors is bounded by the length of the training sequence. Hence, like the sequence memoizer (Wood et al., 2010), the model can be used as an infinite length language model, though in practice it makes sense to restrict dependency to the sentence length (or the document length).

Making use of the suffix trie encoding additionally offers the computational benefit of efficient matrix-vector (or vector-vector multiplication), which lies at the core of op- timization in (3.2). Indeed, each inner product w>v 1m(x) reduces to the following three steps:

a. finding the longest suffix of x of maximal length mmax in S(X).

b. extracting the corresponding entries of wv and.

c. summing them after proper scaling with ((·).

**Word-specific suffix trie structured vectors**

While suffix tree encoding leads to a substantial storage and computational gain it might still cause problems in practice as the number of nodes is large for most real data sequences. We can make further storage and computational gains by building one context tree for every symbol in the vocabulary and constraining the parameters to be non-negative.

We denote the resulting suffix tries by {Sv(x)}v2V with one trie per word in V . The num- ber of free parameters inW is now v |Sv(x)|, which is substantially smaller than |S(x)||V |. Figure 3.2(b) shows the word-specific suffix trees for X = oacac and V = {a,c,o} and Figure 3.3(b) the corresponding matrix of parameters W. All other parameters values are fixed to zeros. Note that the number of free variables varies from one vector to another in W.

**Complexity improvements from trees**

The complexity of the model measured by the number of parameters would vary from one task to another and depends on the patterns within the sequences of that data. We mea- sure this for English language to compare the number of nodes in trie structures against that of the tree structures. We consider four distinct subsets of the Associated Press News (AP-news) text corpus with train-test sizes of 100K-20K for evaluation. The corpus was preprocessed as described in Bengio et al. (2003) by replacing proper nouns, numbers and rare words (occurring fewer than three times) with special symbols “hproper_nouni”, “#n » and “hunknowni » respectively. Sentence boundaries are indicated by manually appending each sentence with a start and end tag at the beginning and the end respectively. Punc- tuation marks are retained which are treated like other normal words. Vocabulary size for each of the training subsets was around 8,500 words.

Figure (3.4) plots the increase in the number of parameters against increasing context lengths for tries Sy(X) and trees My(X) for a sequence of hundred thousand words. Models using the uncollapsed trie structured vectors Sy(X) show a linear increase in the number of parameters as we account for longer contexts to predict the following word. This depen- dence is only logarithmic when we collapse non-branching paths along the tree that share a value as in My(X). Fewer parameters will require fewer updates at each iteration of the optimization over the model and eventually results in a more compact model. This shows how sequences can be encoded efficiently to build models that capture longer dependencies within sequence data.

**Models with unstructured penalties**

In this section and the following one, we fit models with unstructured penalties and evaluate their predictive performance on data. We repeat this evaluation with models of structured penalties in the next section showing how choice of the regularization effects predictive performance.

Evaluation criterion. There are multiple measures to evaluate language models depend- ing on their application. Information theoretic measure of perplexity is one such quantity used in the literature. It is the average of the negative log-likelihood (also referred as the logloss) of the test corpus raised to the power of ten. It upper bounds the number of bits required per word to compress the text using that model. It serves as a generic measure to compare language models but often, evaluation methods are custom designed for spe- cific tasks. For example, word-error rate (WER (Hunt, 1990)) is used to evaluate speech recognizers, bilingual evaluation understudy (BLEU score (Papineni and Roukos, 2002) or its variants like metric for evaluation of translation with explicit ordering (Banerjee and Lavie, 2005)) is used for translation systems, recall-oriented understudy for gisting evaluation (ROUGE score (Lin, 2004) and it variants) for summarization and so on. These measures are tailored to quantify application-specific properties and are sometimes hard to compute (see Chen et al., 1998a, forWER). There is further work discussing how accurately these measures capture their actual objectives (see, e.g., Wang et al., 2003).

We use the standard perplexity measure that is generic and compares how different methods fare at predicting the test corpus. The perplexity measure is computed as follows: P(Ct,W) = 10n− 1 nV P(xy)2Ct I(y2V ) logP(y|x;W)o.

**Table of contents :**

**Détail des contributions **

0.1 Modélisation de la langue naturelle

0.2 Méthodes existantes

0.2.1 Méthodes par lissage

0.2.2 Entropie maximale et maximum de vraisemblance

0.2.3 Méthodes récentes.

0.3 Motivations

0.4 Contributions et plan de la thèse

0.4.1 Pénalités structurées pour la modélisation du langage

0.4.2 Apprentissage de la taxonomie

0.4.3 Plan de la thèse.

**1 Introduction **

1.1 Problem of modeling language

1.2 Traditional approaches

1.2.1 Smoothing models

1.2.2 Maximum entropy models

1.2.3 Predictive performance and computational efficiency

1.3 Recent trends

1.3.1 Bayesian models

1.3.2 Distributed representations

1.4 Summary and motivation

1.5 Contributions and outline

**2 Learning with structure dpenalties **

2.1 Principle of empirical risk minimization

2.2 Penalized loss and structural risk minimization

2.2.1 Unstructured penalties

2.2.2 Structured penalties

2.3 Optimizing penalized loss

2.4 Proximal minimization algorithms

2.4.1 Proximal operators for penalties

2.5 Conclusion

**3 Log-linear languagemodel **

3.1 Introduction to Generalized Linear Models

3.2 Log-linear language model

3.3 Suffix encoding in tries and trees

3.3.1 Suffix trie structured vectors

3.3.2 Word-specific suffix trie structured vectors

3.3.3 Constraining to positive orthant

3.3.4 Word-specific suffix tree-structured vectors

3.3.5 Complexity improvements from trees

3.4 Models with unstructured penalties

3.4.1 Proximal projection with unstructured penalties

3.4.2 Performance evaluation with unstructured penalties

3.5 Models with structured penalties

3.5.1 Proximal projection with `T2 -norm

3.5.2 Proximal projection with `T 1 -norm

3.5.3 Performance evaluation with structured norms

3.5.4 Feature weighting to avoid overfitting

3.5.5 Analysis of perplexity improvements

3.6 Time complexity of proximal operators

3.6.1 Properties of `T 1 -norm.

3.6.2 Fast proximal projection with `T 1 -norm.

3.7 Conclusion

**4 Efficient taxonomylearning **

4.1 Normalization in multi-class log-linear models

4.2 Tree embedding of classes

4.2.1 Measuring the class proximity

4.2.2 Split and merge operations

4.2.3 Top-down procedure

4.2.4 Bottom-up procedure

4.3 Node-specific classifiers

4.4 Evaluation of taxonomy methods

4.5 Conclusion

**5 Concluding remarks **

**Appendices **

**A Smoothing and maximumen tropymodels **

A.1 Smoothing models

A.1.1 Redistribute mass to handle unseen sequences.

A.1.2 Avoid overfitting using lower order distributions.

A.2 Iterative scaling methods

A.2.1 Generalized Iterative Scaling.

A.2.2 Improved Iterative Scaling.

A.2.3 Sequential Conditional Generalized Iterative Scaling.

**B SyntheticDataGeneration **

**C Effect of parameters**