Learning Rich Event Representations and Interactions

Get Complete Project Material File(s) Now! »

Supervised learning approach

In our proposed approach, we cast temporal relation classification as a supervised learning problem. It consists of three main components: event and event-pair representations, models, and inference. We discuss them in the following paragraphs. Event and event-pair representation The essential part of supervised learning approaches for temporal relation classification is a representation of events. Event representations associate a set of events present in a document to a real-valued vector, which acts as an input for themodels. A generic function ZE is designed tomap an events to a de-dimensional vector:
ZE : E →Rde (2.3)
ei 7→ZE (ei )
where ei ∈ E . Next, it is equally important to combine the event representations to get event-pair representations, as temporal relation is a binary relation (between pair of events). Suppose for ei ,e j ∈ E representations obtained with ZE are ei := ZE (ei ),ej := ZE (e j ), then an effective event-pair representations is modeled as:
ZP : Rde ×Rde →Rd′ e (2.4)
(ei,ej) 7→ZP (ei,ej) In this thesis, we learn these functions (ZE ,ZP ) to obtain better event and event-pair representations to solve the task more accurately. In the next chapter (Section 3.1.1), first we detail about the previously proposed approaches to obtain these functions and then in Chapters 4 and 6 we present our approach. Models To solve temporal relation classification task, commonly two types of models are used: local models and global models. The local models learnmodel parameters without considering temporal relation between other pairs (Chambers et al., 2007; Mani et al., 2006). This makes the task a pairwise classification problem where a confidence score corresponding to each temporal relation is predicted for a given pair of events. Generally a local model learns a function of the form: PL,θ : Rd′ e →R|R| where d′ e -dimensional vector representation obtained from ZP , andR is a set of possible temporal relations. On the contrary, global models learn parameters globally while considering temporal relations between other pairs, thus, the learning function takes all event-pair representations and outputs confidence scores corresponding to each pair, modeled as: PG,φ : Rn×d′ e →Rn×|R| where n is number of event pairs.
Inference Both thesemodels produce a confidence score corresponding to each temporal relation for all the event pairs. Therefore, a strategy must be designed to get the temporal graph from these scores. The most straightforward strategy is to choose the temporal relation for a pair that has the highest confidence score. But, this strategy may lead to inconsistent temporal graph prediction. Therefore, amore global strategy needs to be designed. Initially, greedy approaches (Mani et al.; Verhagen and Pustejovsky, 2008) were used. These strategies start with the empty temporal graph, then either add a node or an edge while maintaining the temporal consistency of the graph. Though they produce temporally consistent graphs, they fail to produce optimal solutions. For this, the constraints were converted into Integer Linear Programming (ILP) problemand an optimization objective is solved to produce the graph (Denis andMuller, 2011;Mani et al., 2006; Ning et al., 2017). In our work, we used a local model with a simple inference strategy to obtain rich event-pair representations in Chapter 4, and a global model with ILP based inference approach in Chapter 6 where commonsense knowledge is integrated with contextual information. We also briefly discuss several previously proposed approaches for modeling and inference in the next chapter in Section 3.1.2.

Bridging anaphora resolution

Bridging is an essential part of discourse understanding (Clark, 1975). The reader may have to bridge the currently encountered expression to a previously known information either from the text or from her memory. In his pioneering work, Clark (1975) considered this broad phenomenon as bridging, which connects any expressions that can not be understood without the context to previously appearing phrases. The expression which can not be interpreted without the context is called as anaphor, and the phrase on which it depends for meaning is referred to as antecedent. The earlier definition of bridging included an identical relation between anaphor and antecedent, which is commonly known as coreference. But, over the period, the scope of bridging has changed, so now bridging refers to any association between anaphor and antecedent except coreference. Also, another difference is that, in bridging defined by Clark (1975), an antecedent can be a sentence or a clause that can be useful for interpretation of an anaphor. But, in this work, we are considering only those anaphor-antecedent pairs which are noun phrases (NP) (as practiced by recent researchers). Apart from Clark, Hawkins (1978); Prince (1981, 1992) also studied bridging but referred to this phenomenon differently. Hawkins (1978) termed it as associative anaphora and only considered definite NPs as anaphors, whereas Prince (1981) referred to anaphors which can be inferred from previously mentioned expressions as inferrables.
With this understanding of bridging, we describe the computational task which identifies it automatically: bridging anaphora resolution, in the coming sections. Section 2.1.2.1 formally defines the task, Section 2.1.2.2 discusses main components of supervised learning approaches of solving it, next Section 2.1.2.3 details corpora used in this work, and finally, Section 2.1.2.4 presents an evaluation metric.

Supervised learning approach

Similar to temporal relation classification, in this task as well we are taking supervised learning approach to solve bridging anaphora resolution. This involves three important components: mention representations, models and inference. We detail them here. Mention representations As both anaphors and antecedents are assumed to be a subset ofmentions, obtaining mention representations becomes essential. For that, a generic function ZM is found out as follows:
ZM : M →Rdm (2.7)
mi 7→ZM(mi )
where mi ∈M. We develop an approach to learn this function where contextual and commonsense information is acquired (Chapter 6). Before that, in the next chapter, we detail previously proposed approaches to get this function in Section 3.2.1. Models Similar to temporal relation classification, local models and global models are used for bridging anaphora resolution as well. In the local models for bridging anaphora resolution, a confidence score for bridging anaphor and a previously occurringmention is predicted (Markert et al., 2003; Poesio et al., 2004). Formally, these models learn a function of the form: BL,θ : Rdm ×Rdm →R. On the contrary, global models find corresponding antecedents for anaphors simultaneously (Hou et al., 2013b). This work with global modeling approach did not explicitly find the mention representations but employed Markov Logic Networks (MLN) (Domingos and Lowd, 2009) for global inference. Inference In bridging anaphora resolution, infererence is not as complicated as in temporal relation classification, as it does not follow any complex symmetry or transitivity constraint. The inference step in local models is similar to the best-first clustering. Initially, antecedent candidates are arranged depending on the confidence scores predicted from the model, then the highest scoring candidate antecedent is selected as the predicted antecedent for the bridging anaphor. In case of a global model, Hou et al. (2013b) put some linguistic constraints such as anaphor have less probability of being antecedent, or antecedents have higher probability for being antecedent for other anaphors, in her inference strategy and obtained global inference with MLN.We provide more details about it in the next chapter.

READ Effect of nitrogen starvation on Synechocystis physiology and FNR accumulation

Word representations

In any text, words are considered as core constituents and treated as the lowest meaningful units of a language5. It is also assumed that meanings of the bigger units of language such as phrases, sentences, or documents can be derived from the constituent words. Besides, often some form of text (i.e. a sequence of words) is an input for several NLP tasks. Therefore, it is essential to obtain a meaningful representations of words to solve NLP tasks.
The ideal word representation algorithm should map all the words in a language to its vector representation. However, obtaining all the words in the language is difficult as language is evolving and new words are added constantly. This is addressed by creating a huge vocabulary containing millions or billions of words and getting a vector representation for each word in the vocabulary. A word representation learning algorithm finds a map from each word in vocabulary to their corresponding d-dimensional vector. Let us assume that the vocabulary of words be V , then the algorithm finds followingmap f : f : V →Rd

Table of contents :

List of figures
List of tables
1 Introduction
1.1 Automatic discourse understanding
1.2 Temporal processing and bridging resolution
1.3 Event and mention representations
1.4 Research questions and contributions
1.5 Organization of the dissertation
2 Background
2.1 Tasks
2.1.1 Temporal relation classification
2.1.1.1 Definition
2.1.1.2 Supervised learning approach
2.1.1.3 Corpora
2.1.1.4 Evaluation
2.1.2 Bridging anaphora resolution
2.1.2.1 Definition
2.1.2.2 Supervised learning approach
2.1.2.3 Corpora
2.1.2.4 Evaluation
2.2 Artificial neural networks
2.3 Representation learning
2.4 Word representations
2.4.1 Distributed representations
2.4.1.1 Word2vec
2.4.1.2 Global vector (Glove)
2.4.1.3 FastText
2.4.2 Contextual word representations
2.4.2.1 ELMo
2.4.2.2 BERT
2.5 Composing word representations
2.5.1 Fixed composition functions
2.5.2 Learned composition functions
2.6 Knowledge graphs and representations
2.6.1 Knowledge graphs
2.6.1.1 WordNet
2.6.1.2 TEMPROB
2.6.2 Graph node embeddings
2.6.2.1 Unified framework
2.6.2.2 Matrix factorization based approaches
2.6.2.3 Randomwalk based approaches
2.7 Summary
3 RelatedWork
3.1 Temporal relation classification
3.1.1 Work on event representations
3.1.1.1 Manually designed representations
3.1.1.2 Automatic representation learning
3.1.2 Work on models and inference
3.1.3 Summary
3.2 Bridging anaphora resolution
3.2.1 Work on mention representation
3.2.1.1 Manually designed representation
3.2.1.2 Automatic representation learning
3.2.2 Work on models and inference
3.2.3 Summary
4 Learning Rich Event Representations and Interactions
4.1 Introduction
4.2 Effective event-pair representations
4.3 Method
4.3.1 Representation Learning
4.3.2 Interaction Learning
4.4 Experiments
4.4.1 Datasets and Evaluation
4.4.2 Training details
4.4.3 Baseline systems
4.4.4 Ablation setup
4.5 Results
4.5.1 Comparison to baseline Systems
4.5.2 Comparison with state-of-the-art
4.6 Ablation study
4.7 Conclusions
5 Probing for Bridging Inference in Transformer LanguageModels
5.1 Introduction
5.2 Probing transformer models
5.2.1 Probing for relevant information
5.2.2 Probing approaches
5.3 Methodology
5.4 Probing individual attention heads
5.4.1 Bridging signal
5.4.2 Experimental setup
5.4.3 Results with only Ana-Ante sentences
5.4.4 Results with all sentences
5.4.5 Discussion
5.5 Fill-in-the-gap probing: LMs as Bridging anaphora resolvers
5.5.1 Of-Cloze test
5.5.2 Experimental setup
5.5.3 Results and Discussion
5.5.3.1 Results on candidates scope
5.5.3.2 Results on Ana-Ante distance
5.6 Importance of context: Of-Cloze test
5.6.1 Experimental setup
5.6.2 Results on different contexts
5.7 Error analysis: Of-Cloze test
5.8 Conclusions
6 Integrating knowledge graph embeddings to improve representation
6.1 Introduction
6.2 Commonsense knowledge
6.2.1 Significance for effective representation
6.2.2 Challenges in integration
6.3 Our approach
6.3.1 Knowledge graphs: WordNet and TEMPROB
6.3.1.1 WordNet
6.3.1.2 TEMPROB
6.3.2 Normalization: Simple rules and lemma
6.3.3 Sense disambiguation: Lesk and averaging
6.3.4 Absence of knowledge: Zero vector
6.4 Improved mention representation for bridging resolution
6.4.1 Knowledge-aware mention representation
6.4.2 Ranking model
6.4.3 Experimental setup
6.4.4 Results
6.4.5 Error analysis
6.4.5.1 Mention normalization and sense disambiguation
6.4.5.2 Anaphor-antecedent predictions
6.5 Improved event representation for temporal relation classification
6.5.1 Knowledge-aware event representations
6.5.2 Neural model
6.5.2.1 Constrained learning
6.5.2.2 ILP Inference
6.5.3 Experimental setup
6.5.4 Results
6.5.5 Discussion
6.6 Conclusion
7 Conclusions
References