Explorations in Cross-lingual Contextual Word Embedding Learning

Get Complete Project Material File(s) Now! »

BERT: Bidirectional Encoder Representations from Transformers (Devlin et al., 2019)

After the breakthrough led by ELMo in six NLP tasks, another pre-trained language representation model BERT came out about one year later and advanced the state of the art for eleven NLP tasks. BERT contains two steps: pre-training and fine-tuning. In the pre-training stage, the BERT model is trained on unlabeled data over masked language model (MLM) (i.e. predicting a word that is randomly selected and masked, based on its context) and next sentence prediction tasks. For fine-tuning, BERT starts with pre-trained parameters and fine-tunes all of the parameters based on the downstream tasks. The architectures of these two steps are nearly the same, as shown in Figure 1.8, where the fine-tuning architecture has a different output layer depending on the downstream task.

PMI + SVD: A straightforward and strong baseline method

Applying a singular value decomposition (SVD) to a pointwise mutual information (PMI) word-word matrix is a simple, straightforward method for word embedding learning. Given a word-word co-occurrence matrix, PMI measures the association between a word w and a context c by calculating the log of the ratio between their joint probability (the frequency in which they occur together) and their marginal probabilities (the frequency in which they occur independently) (Levy and Goldberg, 2014b). PMI(w,c) = log #(w,c) · |D| #(w) · #(c) (1.10).
Word embeddings are then obtained by low-rank SVD on this PMI matrix. In particular, the PMI matrix is found to be closely approximated by a low rank matrix (Arora et al., 2016). Note that as explained below, the PMI matrix is often replaced with its positive version (PPMI).

LexVec: explicitly factorizes the PPMI matrix using SGD

Levy and Goldberg (2014b) mentioned that stochastic gradient descent (SGD) can be used for matrix factorization, which is an interesting middle-ground between SGNS and SVD, but left it to future work. Salle et al. (2016) explored this direction and proposed the LexVec model for word embeddings learning. Matrix generation: PPMI PPMI is computed as presented earlier.

AllVec: Alternative to SGD

According to Xin et al. (2018), although most state-of-the-art word embedding learning methods use SGD with negative sampling to learn word representations, SGD suffers from two problems listed below which influence its performance:
• SGD is highly sensitive to the sampling distribution and the number of negative samples. Unfortunately, sampling methods are biased.
• SGD suffers from dramatic fluctuation and overshooting on local minimums. A direct solution is full batch learning, which does not have any sampling method and does not update parameters in each training step. The drawbacks of this solution is also obvious: a large computational cost. Xin et al. (2018) proposed AllVec that generate word embeddings from all training samples using batch gradient learning.

Hyperparameters setting for monolingual word embedding learning

The choice of a training model can indeed have a huge impact on the quality of word embeddings. The selection of parameters can however be crucial too for the final result. Below I talk about the influence of some common parameters used in both prediction-based and count-based methods and the understanding of these parameters.

Contexts selection for training words

Most of the word embedding training methods are based on the Distributional Hypothesis (Harris, 1954). This hypothesis was famously articulated by Firth (1957) as “You shall know a word by the company it keeps”. The company of a word is defined as “context” in word embedding learning. In most current work in natural language processing, this context is defined as the words preceding and following the target word within a fixed distance (i.e. window size). But the definition of context can be varied from two views: the context type and the context representation. Besides the normal linear context definition just introduced before, context can be also defined by dependency relations (Levy and Goldberg, 2014a). In the dependency parse tree. Whether we look at true linguistic dependencies (Harris, 1954) or at graphically neighboring words, we expect that words that are closer in that representation of a sentence (direct predicate-argument relations, or immediate graphical neighbors) provide a stronger context than words that are farther apart (more hops in the dependency tree, or more words in the word sequence). We may quantify the strength of the connection by the distance between context word and target word. In dependency-based context, this strength varies according to the dependency type.

Tips for hyper parameters selection using PPMI, SVD, word2vec and GloVe

Levy and Goldberg (2014b) have shown that skip-gram with negative sampling is implicitly factorizing a word-context matrix. In (Levy et al., 2015), they reveal that compared with different word embeddings learning methods, hyperparameter settings are more essential to the word embedding performance. This paper tries to answer several commonly asked (and also important) questions:
• Can we approach the upper bound on word embedding evaluation tasks on the test set by properly tuning hyperparameters on the training set? Yes.
• A larger hyperparameter tuning space or a larger corpus, which is more worthwhile? It depends, for 3/6 word similarity tasks, a larger hyperparameters space is better. For other tasks, a larger corpus is better.
• Are prediction-based methods superior to count-based distributional methods? No.
• Is GloVe superior to SGNS? No. SGNS obtains a better score than GloVe in every task in this paper.
• Is PPMI on-par with SGNS on analogy tasks? No. SGNS has a better performance.
• Is similarity multiplication (3CosMul) always better than addition (3CosAdd)? Yes, for all methods and on every task.
• CBOW or SGNS? SGNS.
• Always use context distribution smoothing (power 0.75).

Improvements based on pre-trained word embeddings

Some methods do not take a raw corpus but pre-trained word embeddings as input. They aim to fine-tune monolingual word embeddings with additional supervision data. For instance, Faruqui and Dyer (2014) improve the performance of monolingual word embedding by incorporating multilingual evidence. Their method projects monolingual word embeddings onto a common vector space using bilingual alignment data.
Experiments in Faruqui and Dyer (2014) show that inclusion of multilingual context is helpful for monolingual word embeddings generated by SVD and RNN but not for one generated by the Skip-gram model. Check Section 2.4.3 for more details about this method as it is also used for multilingual word embeddings learning.

Evaluation Metrics

There is no absolute standard to say whether a word embedding is good or not. To evaluate the quality of a certain word embedding, an evaluation task is always needed. Either an intrinsic task where word embeddings are directly used, or an extrinsic task where pre-trained word embeddings are used as an input (this is the same principle as pre-trained convnets for image classification). Here, my goal is to study the properties of the embedding space by intrinsic tasks rather than their influence on common NLP tasks by extrinsic tasks.

Prediction-based or count-based methods, which is better?

Since word2vec has been introduced as one of the prediction-based methods for word embeddings learning, “prediction-based or count-based methods, which is better?” has become a commonly ask question. To answer this question, Baroni et al. (2014) performed an extensive evaluation, on a wide range of lexical semantics tasks and across may parameter settings showing that prediction-based methods are the winner.
Because this paper was published in 2014, its extensive evaluation serves as a good summary of the competition between traditional count-based methods (at that time of course) and word2vec. The results (see Table 1.7) show that prediction-based approaches are a good direction for word embeddings training, which has been confirmed by subsequent research after 2014. Note that in Figure 1.7, count-based methods are PMI-SVD-based models with different parameter settings and prediction-based methods are CBOWwith different parameter settings.

READ The application of accelerated learning techniques with special reference to multiple intelligences

Which factors of word embeddings may different tasks rely on? What are the correlations between factors and tasks like?

In most of word embedding learning papers, evaluations just stay in the experiment and results section showing that the approach proposed in the paper achieves state-of-the-art performance on several tasks. While the evaluation tasks can be rich and varied, there is almost no detailed analysis of why certain word embeddings can get better results on certain tasks and what is the relation between different tasks.
That is the reason why Rogers et al. (2018)’s paper is crucial to word embeddings learning and understanding. That paper tries to answer two questions:
• Which aspects of word embeddings may different tasks rely on? (factors of word embeddings).
• What are the properties of embedding X that could predict its performance on tasks Y and Z? (correlations between factors and tasks).
To answer the first question, they proposed a Linguistic Diagnostics (LD)3 approach as shown in Figure 1.10. For each word in one word embedding, LD first extracts its top n neighbors. Then by applying linguistic analysis to the relation of these neighbor-target word pairs, LD can finally get a statistics of different linguistic relations of all neighbor-word pairs extracted from one word embedding. These statistics serve as factors (morphological, lexicographic, psychological and distributional) to represent each word embedding’s linguistic characteristics. By comparing LD factors and word embedding performance on evaluation tasks of different models, a user can not only find which model performs better on a certain task, but also get a hint of possible reasons: the model works better on a certain task because its word embedding is more representative in several linguistic aspects.
To further confirm the possible reason (and also to answer the second question), a data visualization tool (as shown in Figure 1.11) has been used to show the correlations of LD factors and extrinsic/intrinsic tasks between themselves and others based on the data from 60 GloVe and word2vec embeddings.

Word-level alignments based methods

Gouws and Søgaard (2015) introduced a simple but representative corpus preparation stage method. Instead of using a monolingual corpus as in monolingual word embedding training, this method shuffles the non-parallel bilingual corpora and then replaces words with their equivalents, the corresponding translations, with probability 1/2k (where k represents the number of equivalents for the corresponding word) by using a small dictionary. Then they apply off-the-shelf word embedding algorithms to this mixed input. Note that the resulting bilingual word embedding is used for an unsupervised cross-language part-of-speech (POS) tagging task and semi-supervised cross-language super sense (SuS) tagging tasks. We do not know whether it is useful for a bilingual lexicon induction task.

Document-level alignments based methods

Until 2015, most of bilingual word embeddings learning methods were based on using parallel sentences or dictionaries. Vuli´c and Moens (2015) came up with an idea of only using theme-aligned comparable Wikipedia corpora. They merged documents of the same theme in different languages and randomly shuffled the generated documents (see Figure 2.2). Then they applied monolingual word embedding learning methods. Since the original sentence boundaries have been destroyed by shuffling and the “context” for training words is now at the document theme level, they set a larger maximum window size. This method is not convincing for the same reason: the context of words is the key part in word embedding learning. It should be discriminative regarding to different words. A large context range makes more distinct words share the same context, which weakens the discriminative power of the context.

Table of contents :

List of figures
List of tables
Introduction
1 MonolingualWord Embedding and State-of-the-art Approaches
1.1 A brief history about the terminology “word embedding”
1.2 Prediction-based methods
1.2.1 A neural probabilistic language model
1.2.2 word2vec
1.2.3 fastText
1.2.4 Contextual word embedding learning
1.3 Count-based methods
1.3.1 PMI + SVD: A straightforward and strong baseline method
1.3.2 Pull word2vec into count-based methods category
1.3.3 The GloVe method
1.3.4 LexVec: explicitly factorizes the PPMI matrix using SGD
1.3.5 AllVec: Alternative to SGD
1.4 Hyperparameters setting for monolingual word embedding learning
1.4.1 Number of dimensions of word vectors
1.4.2 Contexts selection for training words
1.4.3 Tips for hyperparameters selection using PPMI, SVD, word2vec and GloVe
1.4.4 Improvements based on pre-trained word embeddings
1.5 Evaluation Metrics
1.5.1 Intrinsic tasks
1.5.2 Understanding of evaluation results
1.5.3 Caution when one method “outperforms” the others
2 Cross-lingualWord Embedding and State-of-the-art Approaches
2.1 Introduction
2.2 Corpus preparation stage
2.2.1 Word-level alignments based methods
2.2.2 Document-level alignments based methods
2.3 Training Stage
2.4 Post-training Stage
2.4.1 Regression methods
2.4.2 Orthogonal methods
2.4.3 Canonical methods
2.4.4 Margin methods
2.5 What Has Been Lost in 2019?
2.5.1 Supervised
2.5.2 Unsupervised
2.6 Evaluation Metrics
2.6.1 Word similarity
2.6.2 multiQVEC and multiQVEC-CCA
2.6.3 Summary of experiment settings for cross-lingual word embedding learning models
3 Generation and Processing ofWord Co-occurrence Networks Using corpus2graph
3.1 Word co-occurrence network and corpus2graph
3.1.1 Word-word co-occurrence matrix and word co-occurrence network
3.1.2 corpus2graph
3.2 Efficient NLP-oriented graph generation
3.2.1 Node level: word preprocessing
3.2.2 Node co-occurrences: sentence analysis
3.2.3 Edge attribute level: word pair analysis
3.3 Efficient graph processing
3.3.1 Matrix-type representations
3.3.2 Random walk
3.4 Experiments
3.4.1 Set-up
3.4.2 Results
3.5 Discussion
3.5.1 Difference between word co-occurrence network and target-context word relation in word embeddings training
3.5.2 Three multiprocessing…
3.6 Conclusion
4 GNEG: Graph-Based Negative Sampling for word2vec
4.1 Negative Sampling
4.2 Graph-based Negative Sampling
4.2.1 Word Co-occurrence Network and Stochastic Matrix
4.2.2 (Positive) Target Word Context Distribution
4.2.3 Difference Between the Unigram Distribution and the (Positive) Target Words Contexts Distribution
4.2.4 Random Walks on the Word Co-occurrence Network
4.2.5 Noise Distribution Matrix
4.3 Experiments and Results
4.3.1 Set-up and Evaluation Methods
4.3.2 Results
4.3.3 Discussion
4.4 The implementation of word2vec
4.4.1 The skip-gram model: Predict each context word from its target word?
4.4.2 Relation between learning rate and the number of iterations over the corpus
4.4.3 Gensim: Python version of word2vec
4.5 Conclusion
5 Explorations in Cross-lingual Contextual Word Embedding Learning
5.1 Introduction
5.2 Related work
5.2.1 Supervised mapping
5.2.2 Unsupervised mapping: MUSE
5.3 Average anchor embedding for multi-sense words
5.3.1 Token embeddings
5.3.2 Average anchor embeddings for multi-sense words
5.3.3 Muti-sense words in dictionaries for supervised mapping
5.3.4 Muti-sense words for the unsupervised mapping in MUSE
5.4 Cross-lingual token embeddings mapping with multi-sense words in mind .
5.4.1 Noise in dictionary for supervised mapping
5.4.2 Noisy points for unsupervised mapping in MUSE
5.5 Experiments
5.5.1 Token embeddings
5.5.2 Supervised mapping
5.5.3 Unsupervised mapping
5.5.4 Set-up for embedding visualization
5.6 Results
5.6.1 Visualization of the token embeddings of “bank”
5.6.2 Lexicon induction task
5.7 Discussion and future work
5.7.1 Clustering
5.7.2 Evaluations
5.8 Conclusion
Conclusion
References