Neural Entity-Based Approach for Coreference Resolution in the Clinical Domain

Get Complete Project Material File(s) Now! »

Time in Computational Linguistics

Analyzing time from a linguistic perspective is helpful to understand the mechanisms involved in the realization of time in natural language. In order to being able to model time from a computational linguistics perspective, we must settle on a linguistic model and derive an annotation schema that will be used to annotate time in text. In this section, we present the di􀉮erent approaches that have been devised in the computational linguistics literature to model the three primitives of time: temporal expressions, events and temporal relations.

Temporal Expressions: From TIMEX to TIMEX3

There are four main types of temporal expressions in the literature (Strötgen and Gertz 2016): dates, times, durations and sets. These expressions may be explicit, implicit, relative or underspeci􀈸ed (Strötgen and Gertz 2015). We note that this categorization corresponds roughly to the linguistic description from Section 2.2 in which temporal expressions may denotes calendar dates, times of day, durations or sets of recurring times. Furthermore, the linguistic distinction between absolute and relative position in time is kept.
There were several attempts to model temporal expressions in the computational linguistic literature. In this section, we describe the most in􀉰uential models: the TIMEX model series. TIMEX. This is one of the earliest attempt to create an annotation scheme for temporal expressions. It was developed for the MUC-5 conference (Sundheim 1993). The conference included a shared task on NER that involved the extraction and categorization of dates and times. The task was proposed again in following MUC conferences until MUC-7 (Chinchor 1998).
The goal of the task was to identify and annotate time expressions that denote calendar dates or times with one TIMEX tag. The tag had one type attribute that could take the value Chapter 2 It’s About Time: Temporal Information Extraction from Text date or time. There was no task related to time expression normalization. TIMEX extraction and classi􀉯cation were part of a bigger task related to slot 􀉯lling in which participants were asked to assign times to events (e.g. rocket launching dates). TIDES TIMEX2. The second version of TIMEX was developed under the Defense Advanced Research Projects Agency (DARPA) research project Translingual Information Detection, Extraction and Summarization (TIDES) and the Automatic Content Extraction (ACE) Program. The development of this speci􀉯cation spanned over 􀉯ve years between 2000 and 2005. The last and 􀉯nal version of the annotation scheme is described in Ferro et al. (2005). The TIDES TIMEX2 annotation scheme, materialized by the tag TIMEX2, aimed at annotate a wider range of English temporal expressions. The TIMEX2 tag includes six attributes:
• VAL: it contains the ISO-8601 normalized value of the temporal expression when it represents a point or interval on a calendar or a clock.
• MOD: it is 􀉯lled in when the time expression includes a modi􀉯er (e.g. no more than or approximately).
• ANCHOR_VAL and ANCHOR_DIR: these two attributes are used together to indicate the orientation and anchoring of time expressions.
• SET: it is used to mark time expressions that are representing sets of time. Its only possible value is YES. The absence of the attribute implies that the time expression is not representing a set.
• COMMENT: this attribute may be used by annotators to justify decisions for ambiguous time expressions or to signal doubts during the annotation process.

Temporal Relation Extraction in the Clinical Domain: Adapting the Approach to French Clinical Text

The main motivation for this research e􀉮ort is to evaluate whereas our feature-based approach can be used for other languages than English, provided that the di􀉮erent languagesensitive resources along our preprocessing pipeline are replaced by equivalent resources in the target language. We experiment on the THYME and MERLoT corpora.
Similarly to our participation to Clinical TempEval, we focused on temporal relation extraction and use the gold entities provided within the two corpora. We discarded inter-sentence containment relations as they are not annotated in the French dataset. The MERLoT corpus has been transformed into a comparable corpus according to the process described at Section 3.2. The number of DCT relations per class for both corpora is presented at Table 3.10.

Preprocessing and Feature Extraction

The THYME corpus was preprocessed using cTAKES (Savova et al. 2010), an open source natural language processing system for the extraction of information from electronic health records. We extracted several features from the output of cTAKES: sentences boundaries, tokens, POS tags, token types and Semantic Types of the entities that have been recognized by cTAKES and that have a span overlap with at least one EVENT entity of the THYME corpus. Concerning the MERLoT corpus, no speci􀉯c NLP pipeline exists for French clinical texts; we thus used Stanford CoreNLP system (Manning et al. 2014) to segment and tokenize the text. We also extracted POS tags. As the corpus already provides a type for each EVENT, there is no need for detecting other clinical information. For both DCT and CONTAINS relation extraction tasks, we used a combination of structural, lexical and contextual features yielded from the corpora and the preprocessing steps. The choice of feature is inspired by research e􀉮orts in the temporal information extraction domain (Bethard et al. 2015; UzZaman et al. 2013; Verhagen et al. 2007; Verhagen et al. 2010). These features are presented in Table 3.11.

Table of contents :

Abstract
Résumé
Remerciements
List of Figures
List of Tables
1 Introduction
1.1 Temporal Information Extraction
1.2 Coreference Resolution
1.3 Topic Interdependence
1.4 Research Questions
1.5 Contributions
1.6 Outline
1.7 Published Work
I Temporal Information Extraction from Clinical Narratives
2 It’s About Time: Temporal Information Extraction from Text
2.1 Introduction
2.2 Time in Natural Language: A Linguistic Perspective
2.2.1 Temporal Expressions
2.2.2 Events
2.2.3 Temporal Relations
2.3 Time in Computational Linguistics
2.3.1 Temporal Expressions: From TIMEX to TIMEX3
2.3.2 Events: Task-Dependent Modeling
2.3.3 Temporal Relations
2.4 Resources for Temporal Information Extraction
2.4.1 Full Annotation Schemes
2.4.2 Corpora and Associated Shared Tasks
2.5 Approaches for Temporal Information Extraction
2.5.1 Temporal Expression Extraction
2.5.2 Event Extraction
2.5.3 Relation Extraction
3 Feature-Based Approach for Temporal Relation Extraction
3.1 Introduction
3.2 Data
3.3 Model Overview
3.4 Evaluation on the THYME corpus
3.4.1 Preprocessing and Feature Extraction
3.4.2 Algorithm Selection
3.4.3 Results
3.4.4 Discussion
3.5 Adapting the Approach to French Clinical Text
3.5.1 Preprocessing and Feature Extraction
3.5.2 Experimental Setup
3.5.3 Results
3.5.4 Discussion
3.6 Conclusion
4 Neural Approach for Temporal Information Extraction
4.1 Introduction
4.2 Data
4.3 Model Overview
4.3.1 Entity Extraction
4.3.2 Event Attribute and Document Creation Time Extraction
4.3.3 Containment Relation Extraction
4.3.4 Input Word Embeddings
4.4 In􀉰uence of Categorical Features
4.4.1 Preprocessing
4.4.2 Experimental Setup
4.4.3 Results
4.4.4 Discussion
4.4.5 Perspective
4.4.6 A Word on Temporal Coherence
4.5 Evaluation on the THYME Corpus: Domain Adaptation for Temporal Information
4.5.1 Preprocessing
4.5.2 Architecture Description
4.5.3 Domain Adaptation Strategies
4.5.4 Network Training
4.5.5 Results
4.5.6 Discussion
4.6 Conclusion
II Clinical Event Coreference Resolution
5 Clinical Event Coreference Resolution
5.1 Introduction
5.2 Anaphora and Coreference: A Linguistic Perspective
5.3 De􀉯nitions and Terminology: The NLP imbroglio
5.4 Event Coreference Resolution
5.5 Annotated Corpora
5.6 A Word on Mention Extraction
5.7 Early Approaches for Coreference Resolution
5.8 Supervised Approaches for Coreference Resolution
5.8.1 Mention-Pair Model
5.8.2 Mention-Ranking Model
5.8.3 Entity-Based Models
5.8.4 Tree-Based Models
5.9 Coreference Resolution in the Clinical Domain
5.10 Evaluation Metrics
5.10.1 The MUC Score
5.10.2 The B3 Algorithm
5.10.3 Constrained Entity-Aligned F-Measure
5.10.4 BLANC
5.10.5 The CoNLL Score
5.11 Conclusion
6 Neural Entity-Based Approach for Coreference Resolution in the Clinical Domain
6.1 Introduction
6.2 Data
6.3 Task Division
6.4 Mention Extraction
6.5 Building a Temporal Feature
6.6 Neural Entity-Based Approach for Coreference Resolution
6.6.1 Input Embeddings
6.6.2 Mention Representation
6.6.3 Cluster-Level Representation
6.6.4 Pairwise Scorer
6.6.5 Training
6.6.6 Wrap-Up
6.7 Experimental Setup
6.7.1 Experiment Con􀉯gurations
6.7.2 Hyperparameters
6.8 Results
6.9 Discussion
6.10 Conclusion
7 Conclusion
7.1 Summary
7.2 Future Research Directions
7.3 Extracting Clinical Timelines: Are We There Yet?
References