Intrinsic evaluation of compositionality prediction

Get Complete Project Material File(s) Now! »

Compound selection

The initial set of idiomatic and partially compositional candidates was constructed by introspection, independently for each language. This list of compounds was complemented by selecting entries from lists of frequent adjective+noun and noun+noun pairs. These were automatically extracted through POS-sequence queries using the mwetoolkit (Ramisch 2015). The source corpora were ukWaC (Baroni, Bernardini, Ferraresi, et al. 2009), frWaC and brWaC (Boos, Prestes, and Villavicencio 2014), each containing between 1.5 and 2.5 billion tokens. We avoided selecting compounds in which the head was not necessarily a noun (e.g. FR aller simple ‘one-way ticket’ (lit. going simple), as aller doubles as the noun going and the infinitive of the verb to go). We also avoided selecting compounds whose literal sense was very common in the corpus (e.g. EN low blow). For PT and FR, we additionally discarded the compounds in which the complement was not an adjective (e.g. PT noun–noun abelha-rainha ‘queen bee’ (lit. bee-queen)), as these constructions are often seen as exocentric (no head/modifier distinction can be made between the compound elements).
For each language, a balanced set of 60 idiomatic, 60 partially compositional and 60 fully compositional compounds was selected by means of a coarse-grained manual pre-annotation.We eschewed any attempts at selecting equivalent compounds for all three languages. A compound in a given language may correspond to a single word in the other languages, and even when it does translate as another compound, its pattern of POS-tags and its level of compositionality may be widely different.

Sentence selection

For each compound, we selected 3 sentences from the WaC corpus where the compound is used with the same meaning. We sorted them by sentence length, in order to favor shorter sentences, and manually selected 3 examples that satisfy these criteria:
• The occurrence of the compound must have the same meaning in all three sentences.
• Each sentence must contain enough context to enable a clear disambiguation of the compound.
• There must be enough inter-sentence variability, so as to provide a higher amount of disambiguating contexts. The goal of these sentences was to be used as disambiguating context for the annotators. For example, for the compound benign tumour, we present the following disambiguating sentences: (1) “Prince came onboard to have a large benign tumor removed from his head”; (2) “We were told at that time it was a slow growing benign tumor and to watch and wait”; (3) “Completely benign tumor is oncocytoma (it represents about 5 % of all kidney tumors)”.

Questionnaire design

The questionnaires were presented as online webpages, and followed the same structure for each compound. The questionnaire starts with a set of instructions that briefly describe the task and direct participants to fill an external identification form. This form collects demographics about the annotators, and ensures that they are native speakers of the target language, following Reddy, McCarthy, and Manandhar (2011). This form also presents some example questions with annotated answers for training. After filling in the identification form, users could start working on the task itself. The questionnaire was structured in 5 subtasks, presented to the annotators through these instructions:
1. Read the compound itself.
2. Read 3 sentences containing the compound.
3. Provide 2 to 3 synonym expressions for the target compound seen in the sentences, preferably involving one of the words in the compound. We ask annotators to prioritize short expressions, with 1 to 3 words each, and to try to include the words from the nominal compound in their reply (eliciting a paraphrase).
4. Using a Likert scale from 0 (completely disagree) to 5 (completely agree), judge how much of the meaning of the compound comes from modifier and head separately. Figure 3.1 shows an example for the judgment of the literality of the head (benign) in the compound benign tumor.
5. Using a Likert scale from 0 (completely disagree) to 5 (completely agree), judge how much of the meaning of the compound comes from both of its components (head and modifier). This judgment is requested through a question that paraphrases the compound: “would you say that a benign
tumor is always literally a tumor that is benign?”. We have been consciously careful about requiring answers in an even-numbered scale (0–5 makes for 6 reply categories), as otherwise, undecided annotators could be biased towards the middle score. As an additional help for the annotators, when the mouse hovers over a reply to a multiple-choice question, we present a guiding tooltip, as in Figure 3.1. We avoid incomplete replies by making Subtasks 3–5 mandatory.
The order of subtasks has also been taken into account. During a pilot test, we found that presenting the multiple-choice questions (Subtasks 4–5) before compositionality regarding the head of the compound. asking for synonyms (Subtask 3) yielded lower agreement, as users were often less self-consistent in the multiple-choice questions (e.g. replying that “benign tumor is not a tumor” in Subtask 4 while replying that “benign tumor is a tumor that is benign” in Subtask 5). This behavior was observed even when they later carefully selected their synonyms. Asking for synonyms in Subtask 3, prior to the multiple-choice questions, prompts the user focus on the target meaning for the compound and also have more examples (the synonyms) when considering the semantic contribution of each element of the compound. In this work, the synonyms were only used to motivate annotators to think about the meaning of the compound. In the future, this information could be exploited for compositionality prediction, but also for lexical substitution tasks (Wilkens, Zilio, Cordeiro, et al. 2017).

READ Improving TQ generation in MT into English: our post-edition approach

Judgment collection

Annotators participated via online questionnaires, with one webpage per compound. For EN and FR, annotators were recruited and paid through Amazon Mechanical Turk (AMT). For PT, we developed a standalone web interface that simulates AMT, as Portuguese speakers were rare in that platform. Annotators for PT were undergraduate and graduate students of Computer Science, Linguistics and Psychology. For each compound, we have collected judgments from around 15 annotators.1 For each compound, the response from all annotators were gathered up into an average compound score. We obtained the following variables:
• cH: The contribution of the head to the meaning of the compound (e.g. is a busy bee literally a bee?), with standard deviation H.
• cM: The contribution of the modifier to the meaning of the compound (e.g. is a busy bee literally busy?), with standard deviation M.
• cWC: The degree to which the whole compound can be interpreted as a combination of its parts (e.g. is a busy bee a bee that is busy?), with standard 1 EN includes the 90 compounds from Reddy, McCarthy, and Manandhar (2011), which are compatible with the other 90 compounds collected for the dataset. deviation WC.

Table of contents :

Abstract
Résumé
Resumo
Acknowledgments
List of Figures
List of Tables
Abbreviations
1 Introduction
1.1 Contributions
1.2 Investigated hypotheses
1.3 Context of the thesis
1.4 Publications
1.5 Thesis structure
2 Background
2.1 Basic terminology
2.2 Language and statistics
2.2.1 Occurrence counts
2.2.2 Co-occurrence counts
2.2.3 Association measures
2.2.4 Inter-rater agreement measures
2.2.5 Evaluation measures for continuous data
2.2.6 Evaluation measures for categorical data
2.3 Multiword expressions
2.3.1 Nominal compounds
2.3.2 Type discovery
2.3.3 Token identification
2.4 Word semantics
2.4.1 Symbolic representations
2.4.2 Numerical representations
2.5 MWE semantics
2.5.1 Symbolic representation
2.5.2 Numerical representation
2.5.3 Compositionality datasets in the literature
3 Compositionality datasets
3.1 Data collection
3.1.1 Compound selection
3.1.2 Sentence selection
3.1.4 Judgment collection
3.2.1 Score distribution
3.2.2 Difficulty of annotation
3.2.3 Estimating whole-compound from head/modifier
3.2.4 Correlation with distributional variables
3.3 Summary
4 Compositionality prediction
4.1 Related work
4.2 Proposed model
4.3 Corpus preprocessing
4.4 DSMs
4.5 Parameters
4.6 Evaluation setup
5 Intrinsic evaluation of compositionality prediction
5.1 Overall highest results per DSM
5.1.1 English
5.1.2 French
5.1.3 Portuguese
5.1.4 Cross-language analysis
5.2 DSM parameters
5.2.1 Context-window size
5.2.2 Number of dimensions
5.3 Corpus parameters
5.3.1 Type of preprocessing
5.3.2 Corpus size
5.3.3 Parallel predictions
5.4 Prediction strategy
5.5 Sanity checks
5.5.1 Number of iterations
5.5.2 Minimum count threshold
5.5.3 Windows of size 2+2
5.5.4 Higher number of dimensions
5.5.5 Random initialization
5.5.6 Data filtering
5.6 Error analysis
5.6.1 Frequency and compositionality prediction
5.6.2 Conventionalization and compositionality prediction
5.6.3 Human–system comparison
5.6.4 Range-based analyses 1
5.7 Summary and discussion
6 Extrinsic evaluation of compositionality prediction
6.1 Proposed models of MWE identification
6.1.1 Rule-based identification model
6.1.2 Probabilistic identification model
6.2 Experimental setup
6.2.1 Reference corpora
6.2.2 Compositionality lexicons
6.2.3 Rule-based identification
6.2.4 Probabilistic identification
6.3 Results
6.3.1 Rule-based identification: baseline
6.3.2 Rule-based identification: compositionality scores
6.3.3 Probabilistic identification
6.4 Summary
7 Conclusions
7.1 Contributions
7.2 Future work
References