Probing for Bridging Inference in Transformer LanguageModel

Get Complete Project Material File(s) Now! »

Composing word representations

In most of the NLP tasks, we are interested in obtaining representations of word sequences, in other words, for bigger linguistic units such as phrases, sentences, paragraphs, or documents rather than words. In these cases word representations are of little use directly, for example, consider the task of sentence classification where we want to assign a certain class to a sentence then it is beneficial to get the sentence representation, or consider the task of assigning topics such as Sports, Finance, Medicine, etc. to documents which will also require document representations. It is usually assumed that linguistic structures are compositional i.e. simpler elements are combined to form more complex ones. For example, morphemes are combined into words, words into phrases, phrases into sentences, and so on. Therefore, it is reasonable to assume that the meanings of bigger linguistic chunks such as phrases, sentences, paragraphs, and documents are composed of the meaning of constituent words (Frege’s principle). This compositional principle is used to obtain the representation of these bigger chunks by composing the representations of constituent words7.
Suppose u is any bigger linguistic unit (phrase, sentence or document) containing se-quence of words as w1, w2,··· , wl and w1,w2, ··· ,wl is their corresponding representation. Then representation u for linguistic unit u is obtained as : u = f (w1,w2,··· ,wl ) (2.50) One of the important things while acquiring function f is that the syntax of the unit u should be considered. Because the meaning of word sequence is derived not only from the meaning of its constituent words but also from the syntax in which they are combined (Partee, 1995). For example, if the syntax of the sentence is not considered then the meaningful sentence “I ate pizza” will get a similar representation as “ate I pizza”.
7The representations of bigger linguistic units can be obtained without explicitly composing representa-tions of constituent units, we omitted that, as it is not relevant to the present discussion. Hence, the composition function 2.50.
In addition to the syntactic information, the meaning of the word sequence also depends on the additional knowledge which is outside of the linguistic structure. This additional information includes both knowledge about the language itself and also knowl-edge about the real world. For example, the sentence “Let’s dig deeper.” can mean either digging the soil further or making the extra efforts8. So, the composition function f needs to be changed again to incorporate this additional knowledge K . The modified composi-tion function which includes syntactic information S and knowledge K is given as: u = f (w1,w2,··· ,wl ,S,K ) (2.51) This composition function f can be either designed (fixed composition functions) or can be learned (learned composition functions). We look at them separately in the following paragraphs.

Fixed composition functions

These functions generally ignore the information K ( Eq. 2.51) while obtaining the repre-sentation. Also, it is assumed that vector representations of word sequences lie in the same vector space of the constituent words. Because of these assumptions, simple addition, average, or multiplicative functions can be used to get the composite representation (Foltz et al., 1998; Landauer and Dumais, 1997; Mitchell and Lapata, 2010; Zanzotto et al., 2010)9.

Learned composition functions

The previous approach of combining the constituent word representations puts a lot of constraints while designing the function, for instance these approaches assume that a vector of word sequences also lies in the same space as word vectors, which may not hold in reality. Also, often the functions designed are not effective because of their simplistic way of combination. Because of this, instead of manually designing these functions, functions are parameterized and the parameters governing the function are learned. The general definition of these functions is slightly different from Eq. 2.51 which is given as: u = f (w1,w2,··· ,wl ;Θ) (2.56).
Here, parameters Θ are learned with machine learning models. It is important to learn Θ in such a way that they can capture the syntactic information present in the unit u. Generally, these parameters also capture a small amount of additional knowledge because of the context but these methods also largely ignore the external knowledge while acquiring the composite representation. Commonly, the parameters Θ are learned either in the task-agnostic or task-specific fashion. In the task-agnostic methods, the parameters are usually trained by unsupervised or semi-supervised learning and can be served as features for many other NLP tasks such as text classification and semantic textual similarity. This includes recursive auto-encoders (Socher et al., 2011), ParagraphVector (Le and Mikolov, 2014), SkipThought vectors (Kiros et al., 2015), FastSent (Hill et al., 2016), Sent2Vec (Pagliardini et al., 2018), GRAN (Wieting and Gimpel, 2017), transformer based models like BERT (Devlin et al., 2019).
On the other hand, in task-specific approach, the representation learning is combined with downstream applications and trained by supervised learning. Different deep learning models are trained to solve certain NLP tasks, FFNNs (Huang et al., 2013), (Chung et al., 2014; Hochreiter and Schmidhuber, 1997), CNNs (Kalchbrenner et al., 2014; Kim, 2014; Shen et al., 2014), and recursive neural networks (Socher et al., 2013).
Overall, the approaches based on the deep learning techniques have shown promising performance for learning these parameters. Socher et al. (2012) show the efficiency of deep learning approaches by comparing them with the simple average of word vectors, elementwise multiplication, and concatenation. Further, similar results were observed in (Socher et al., 2013).

Knowledge graphs and representations

In previous sections, we looked at various approaches of obtaining word embeddings and several composition methods to get word sequences representations from them. However, these word embedding algorithms use only text data to learn representations, as a result, they fail to adequately acquire commonsense knowledge like semantic and world knowledge. To address that limitation, various methods have been proposed to enrich word embeddings with commonsense knowledge (Faruqui et al., 2015; Osborne et al., 2016; Peters et al., 2019; Sun et al., 2020; Yu and Dredze, 2014).
As we also make use of such external knowledge in our work, in this section, we describe one of the popular sources of commonsense knowledge: Knowledge Graph, and approaches of representing knowledge held by them. Specifically, in Section 2.6.1, we describe knowledge graph and look at the popular lexical knowledge source: Word-Net (Fellbaum, 1998), which is used in this work. We also describe another knowledge source, TEMPROB (Ning et al., 2018a), which is specifically constructed to store proba-bilistic temporal relations information, and used for temporal relation classification in this work. Next, in Section 2.6.2, we explain the problem of graph representations which is a challenging task, as the information present in the whole topology of the graph should be captured in the representation. Node embeddings learned over graphs proved to be effective at capturing such knowledge (Hamilton et al., 2017) so we describe their general framework and two prominent families of approaches in the subsequent subsections. This background of node embeddings framework will be beneficial for understanding specific node embedding algorithms used over WordNet and TEMPROB in Chapter 6.

READ Putting assessment for learning into practice

Knowledge graphs

Commonsense knowledge is generally stored in a graph-structure format, commonly called as Knowledge Graphs. In knowledge graphs, nodes denote real-world entities or abstract concepts, and edges show relations between them. Because of this broad def-inition, knowledge graphs can be found with a variety of data. Broadly speaking, they can be categorized as open-domain or domain-specific knowledge graphs. For example, popular knowledge graphs such as YAGO (Hoffart et al., 2011), DBpedia (Lehmann et al., 2015) contain open-domain information as the nodes can be people, organizations, or places, and multiple relations between them are denoted with edges. On the other hand, some knowledge graphs are designed for specific domains: language, specifically lexical resources, WordNet (Fellbaum, 1998), FrameNet (Ruppenhofer et al., 2006), Concept-Net (Speer et al., 2018), geography (Stadler et al., 2012), media (Raimond et al., 2014), and many more.
Formally, let G = (V,E ,R) be any knowledge graph where V , E denote nodes and edges of the graph and R is a set of possible relations between nodes. Then, graphs can possess different information depending on the types of edges. An unlabeled knowledge graph contains edges that only have tuples of nodes: E = {(u, v) : u, v ∈ V }. Next, the edges of a knowledge graph with labels are set of triples: E = {(u,r, v) : u, v ∈ V,r ∈ R}. On the other hand, for a probabilistic graph, in addition to relations there is a scalar value denoting strength of the edge: E = {(u,r, v, s) : u, v ∈ V,r ∈ R, s ∈ R}. In this work, we used two knowledge graphs: WordNet (Fellbaum, 1998) and TEM-PROB (Ning et al., 2018a). Let us look at them in the following sections.

Table of contents :

List of figures
List of tables
1 Introduction
1.1 Automatic discourse understanding
1.2 Temporal processing and bridging resolution
1.3 Event and mention representations
1.4 Research questions and contributions
1.5 Organization of the dissertation
2 Background
2.1 Tasks
2.1.1 Temporal relation classification
2.1.1.1 Definition
2.1.1.2 Supervised learning approach
2.1.1.3 Corpora
2.1.1.4 Evaluation
2.1.2 Bridging anaphora resolution
2.1.2.1 Definition
2.1.2.2 Supervised learning approach
2.1.2.3 Corpora
2.1.2.4 Evaluation
2.2 Artificial neural networks
2.3 Representation learning
2.4 Word representations
2.4.1 Distributed representations
2.4.1.1 Word2vec
2.4.1.2 Global vector (Glove)
2.4.1.3 FastText
2.4.2 Contextual word representations
2.4.2.1 ELMo
2.4.2.2 BERT
2.5 Composing word representations
2.5.1 Fixed composition functions
2.5.2 Learned composition functions
2.6 Knowledge graphs and representations
2.6.1 Knowledge graphs
2.6.1.1 WordNet
2.6.1.2 TEMPROB
2.6.2 Graph node embeddings
2.6.2.1 Unified framework
2.6.2.2 Matrix factorization based approaches
2.6.2.3 Randomwalk based approaches
2.7 Summary
3 RelatedWork
3.1 Temporal relation classification
3.1.1 Work on event representations
3.1.1.1 Manually designed representations
3.1.1.2 Automatic representation learning
3.1.2 Work on models and inference
3.1.3 Summary
3.2 Bridging anaphora resolution
3.2.1 Work on mention representation
3.2.1.1 Manually designed representation
3.2.1.2 Automatic representation learning
3.2.2 Work on models and inference
3.2.3 Summary
4 Learning Rich Event Representations and Interactions
4.1 Introduction
4.2 Effective event-pair representations
4.3 Method
4.3.1 Representation Learning
4.3.2 Interaction Learning
4.4 Experiments
4.4.1 Datasets and Evaluation
4.4.2 Training details
4.4.3 Baseline systems
4.4.4 Ablation setup
4.5 Results
4.5.1 Comparison to baseline Systems
4.5.2 Comparison with state-of-the-art
4.6 Ablation study
4.7 Conclusions
5 Probing for Bridging Inference in Transformer LanguageModels
5.1 Introduction
5.2 Probing transformer models
5.2.1 Probing for relevant information
5.2.2 Probing approaches
5.3 Methodology
5.4 Probing individual attention heads
5.4.1 Bridging signal
5.4.2 Experimental setup
5.4.3 Results with only Ana-Ante sentences
5.4.4 Results with all sentences
5.4.5 Discussion
5.5 Fill-in-the-gap probing: LMs as Bridging anaphora resolvers
5.5.1 Of-Cloze test
5.5.2 Experimental setup
5.5.3 Results and Discussion
5.5.3.1 Results on candidates scope
5.5.3.2 Results on Ana-Ante distance
5.6 Importance of context: Of-Cloze test
5.6.1 Experimental setup
5.6.2 Results on different contexts
5.7 Error analysis: Of-Cloze test
5.8 Conclusions
6 Integrating knowledge graph embeddings to improve representation
6.1 Introduction
6.2 Commonsense knowledge
6.2.1 Significance for effective representation
6.2.2 Challenges in integration
6.3 Our approach
6.3.1 Knowledge graphs: WordNet and TEMPROB
6.3.1.1 WordNet
6.3.1.2 TEMPROB
6.3.2 Normalization: Simple rules and lemma
6.3.3 Sense disambiguation: Lesk and averaging
6.3.4 Absence of knowledge: Zero vector
6.4 Improved mention representation for bridging resolution
6.4.1 Knowledge-aware mention representation
6.4.2 Ranking model
6.4.3 Experimental setup
6.4.4 Results
6.4.5 Error analysis
6.4.5.1 Mention normalization and sense disambiguation
6.4.5.2 Anaphor-antecedent predictions
6.5 Improved event representation for temporal relation classification
6.5.1 Knowledge-aware event representations
6.5.2 Neural model
6.5.2.1 Constrained learning
6.5.2.2 ILP Inference
6.5.3 Experimental setup
6.5.4 Results
6.5.5 Discussion
6.6 Conclusion
7 Conclusions
References