Explorations in Cross-lingual Contextual Word Embedding Learning

Get Complete Project Material File(s) Now! »

A brief history about the terminology “word embedding”

The roots of word embeddings can be traced back to the 1950s when the distributional hypothesis, of which the underlying idea is that “a word is characterized by the company it keeps” (Firth, 1957), was discussed in linguistics. The concept and the training models about “word embedding” are evolving with the progress in natural language processing. In fact, even the terminology of “word embedding” itself went a long way to what it is today (as shown in Figure 1.1). Below we give a brief history about the evolution of the terminologies used for “word embedding”.
As introduced before, word embedding starts with the distributional hypothesis in linguistics. This hypothesis suggests that words that are used and occur in the same contexts tend to purport similar meanings (Harris, 1954). The semantic similarities between words become measurable by comparing their companies (context words). In natural language processing, a similar idea called Vector Space Model (VSM) has been firstly introduced in information retrieval (Rocchio, 1971; Salton, 1962) to determine document similarities. Based on the term-document matrix, each document is represented as a vector in the VSM, where Points that are close together in this space are semantically similar and points that are far apart are semantically distant (Turney and Pantel, 2010).
The notion of a distributed semantic representation, another important element of word embedding, comes after the vector representation basis. It aims to reduce the number of dimensions by using techniques like singular value decomposition (SVD) and latent semantic analysis (LSA) in VSM as the previous term-document matrix used for information retrieval is too sparse to measure the distributional similarity and the matrix is getting larger and larger with the increase of the data. After the success of LSA in information retrieval (Deerwester et al., 1990), Schütze (1993) introduced Word Space, distributed semantic representations for words derived from lexical co-occurrence statistics. Using distributed semantic representations for words led to many improvements in different NLP tasks in the 2000s such as word sense discovery (Rapp, 2003) and the similarity of semantic relations (Turney, 2006).

Contextual word embedding learning

All word embeddings discussed before are context independent, i.e. each distinct word can have only one vector representation. This has two drawbacks:
• A word always has the same representation, regardless of the context in which its individual tokens occur. Especially for polysemy, one vector representation is not enough for its different senses.
• Even for words that have only one sense, their occurrences still have different aspects including semantics, syntactic behavior and language register/connotations. Different NLP tasks may need different aspects from words.
To solve these problems, the idea of contextual word embeddings came up. Contextual word embeddings are dynamic, based on the context each token is used in, rather than a static context-independent embedding. Below I introduce two representative methods, which also led to breakthrough in many NLP tasks.

BERT: Bidirectional Encoder Representations from Transformers (Devlin et al., 2019)

After the breakthrough led by ELMo in six NLP tasks, another pre-trained language representation model BERT came out about one year later and advanced the state of the art for eleven NLP tasks. BERT contains two steps: pre-training and fine-tuning. In the pre-training stage, the BERT model is trained on unlabeled data over masked language model (MLM) (i.e. predicting a word that is randomly selected and masked, based on its context) and next sentence prediction tasks. For fine-tuning, BERT starts with pre-trained parameters and fine-tunes all of the parameters based on the downstream tasks. The architectures of these two steps are nearly the same, as shown in Figure 1.8, where the fine-tuning architecture has a different output layer depending on the downstream task.

Hyperparameters setting for monolingual word embedding learning

The choice of a training model can indeed have a huge impact on the quality of word embeddings. The selection of parameters can however be crucial too for the final result. Below I talk about the influence of some common parameters used in both prediction-based and count-based methods and the understanding of these parameters.

Tips for hyperparameters selection using PPMI, SVD, word2vec and GloVe

Levy and Goldberg (2014b) have shown that skip-gram with negative sampling is implicitly factorizing a word-context matrix. In (Levy et al., 2015), they reveal that compared with different word embeddings learning methods, hyperparameter settings are more essential to the word embedding performance. This paper tries to answer several commonly asked (and also important) questions:
• Can we approach the upper bound on word embedding evaluation tasks on the test set by properly tuning hyperparameters on the training set? Yes.
• A larger hyperparameter tuning space or a larger corpus, which is more worthwhile? It depends, for 3/6 word similarity tasks, a larger hyperparameters space is better. For other tasks, a larger corpus is better.
• Are prediction-based methods superior to count-based distributional methods? No.
• Is GloVe superior to SGNS? No. SGNS obtains a better score than GloVe in every task in this paper.
• Is PPMI on-par with SGNS on analogy tasks? No. SGNS has a better performance.
• Is similarity multiplication (3CosMul) always better than addition (3CosAdd)? Yes, for all methods and on every task.
• CBOW or SGNS? SGNS.

READ THE QUALITY OF EDUCATION, SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT

Table of contents :

List of figures
List of tables
Introduction
1 MonolingualWord Embedding and State-of-the-art Approaches
1.1 A brief history about the terminology “word embedding”
1.2 Prediction-based methods
1.2.1 A neural probabilistic language model
1.2.2 word2vec
1.2.3 fastText
1.2.4 Contextual word embedding learning
1.3 Count-based methods
1.3.1 PMI + SVD: A straightforward and strong baseline method
1.3.2 Pull word2vec into count-based methods category
1.3.3 The GloVe method
1.3.4 LexVec: explicitly factorizes the PPMI matrix using SGD
1.3.5 AllVec: Alternative to SGD
1.4 Hyperparameters setting for monolingual word embedding learning
1.4.1 Number of dimensions of word vectors
1.4.2 Contexts selection for training words
1.4.3 Tips for hyperparameters selection using PPMI, SVD, word2vec and GloVe
1.4.4 Improvements based on pre-trained word embeddings
1.5 Evaluation Metrics
1.5.1 Intrinsic tasks
1.5.2 Understanding of evaluation results
1.5.3 Caution when one method “outperforms” the others
2 Cross-lingualWord Embedding and State-of-the-art Approaches
2.1 Introduction
2.2 Corpus preparation stage
2.2.1 Word-level alignments based methods
2.2.2 Document-level alignments based methods
2.3 Training Stage
2.4 Post-training Stage
2.4.1 Regression methods
2.4.2 Orthogonal methods
2.4.3 Canonical methods
2.4.4 Margin methods
2.5 What Has Been Lost in 2019?
2.5.1 Supervised
2.5.2 Unsupervised
2.6 Evaluation Metrics
2.6.1 Word similarity
2.6.2 multiQVEC and multiQVEC-CCA
2.6.3 Summary of experiment settings for cross-lingual word embedding learning models
3 Generation and Processing ofWord Co-occurrence Networks Using corpus2graph
3.1 Word co-occurrence network and corpus2graph
3.1.1 Word-word co-occurrence matrix and word co-occurrence network
3.1.2 corpus2graph
3.2 Efficient NLP-oriented graph generation
3.2.1 Node level: word preprocessing
3.2.2 Node co-occurrences: sentence analysis
3.2.3 Edge attribute level: word pair analysis
3.3 Efficient graph processing
3.3.1 Matrix-type representations
3.3.2 Random walk
3.4 Experiments
3.4.1 Set-up
3.4.2 Results
3.5 Discussion
3.5.1 Difference between word co-occurrence network and target-context word relation in word embeddings training
3.5.2 Three multiprocessing…
3.6 Conclusion
4 GNEG: Graph-Based Negative Sampling for word2vec
4.1 Negative Sampling
4.2 Graph-based Negative Sampling
4.2.1 Word Co-occurrence Network and Stochastic Matrix
4.2.2 (Positive) Target Word Context Distribution
4.2.3 Difference Between the Unigram Distribution and the (Positive) Target Words Contexts Distribution
4.2.4 Random Walks on the Word Co-occurrence Network
4.2.5 Noise Distribution Matrix
4.3 Experiments and Results
4.3.1 Set-up and Evaluation Methods
4.3.2 Results
4.3.3 Discussion
4.4 The implementation of word2vec
4.4.1 The skip-gram model: Predict each context word from its target word?
4.4.2 Relation between learning rate and the number of iterations over the corpus
4.4.3 Gensim: Python version of word2vec
4.5 Conclusion
5 Explorations in Cross-lingual Contextual Word Embedding Learning
5.1 Introduction
5.2 Related work
5.2.1 Supervised mapping
5.2.2 Unsupervised mapping: MUSE
5.3 Average anchor embedding for multi-sense words
5.3.1 Token embeddings
5.3.2 Average anchor embeddings for multi-sense words
5.3.3 Muti-sense words in dictionaries for supervised mapping
5.3.4 Muti-sense words for the unsupervised mapping in MUSE
5.4 Cross-lingual token embeddings mapping with multi-sense words in mind .
5.4.1 Noise in dictionary for supervised mapping
5.4.2 Noisy points for unsupervised mapping in MUSE
5.5 Experiments
5.5.1 Token embeddings
5.5.2 Supervised mapping
5.5.3 Unsupervised mapping
5.5.4 Set-up for embedding visualization
5.6 Results
5.6.1 Visualization of the token embeddings of “bank”
5.6.2 Lexicon induction task
5.7 Discussion and future work
5.7.1 Clustering
5.7.2 Evaluations
5.8 Conclusion
Conclusion
References