Get Complete Project Material File(s) Now! »

## Snippet of Graph Theory

Graph theory deals with sets of objects structured by pairwise relations. Those structured sets are called graphs. It is a basic framework used for representing syntactic structures as well as language relationship.

A graph G = (V; E) is a set of vertices (or nodes) v 2 V that may be linked to each other by edges e 2 E that represent relations between those vertices E V V. Two vertices vi and vj are said to be adjacent if they are linked by an edge eij . If the relation represented by E is symmetrical, we say that the graph is undirected and we have that eij = eji for all pairs of adjacent vertices. Figure 2.1 depicts a simple undirected graph. It has five vertices and four edges. If however, the relation represented by E is asymmetrical, the graph is directed. In that case, eij is the edge going from i to j and eji goes in the other direction. In general, the existence of an edge eij does not imply the existence of an edge eji going in the other direction. Figure 2.2 shows a directed graph. Some definitions allow self loops (edges going from a vertex to itself), but as this is mostly irrelevant for the structure we are considering in this thesis, we will assume that self loops are not allowed. A graph can also be weighted. In that case, a graph is a triplet G = (V; E; W) with W = fwij 2 R; 8eij 2 Eg being the weights of the edges of G. Edges not in E are often given a weight of 0. Figure 2.3 shows a weighted graph.

A path is a sequence p = fpt 2 Ejt 2 Njpjg of edges of length jpj, such that if pt = eij and pt+1 = ei0j0 then j = i0. We also assume that 8t; t0 2 N2jpj 1, such that t < t0 , pt = eij and pt0 = ei0j0 then i 6= j0. A path cannot go through an vertex more than once, except maybe the first/last vertex of the path, in that case the path is also called a cycle. We say that two vertices vi and vk are connected if there exists a path p 2 E such that p goes from u to v. A connected component is a maximal subset of V such that any two vertices in it are connected. We say that G is a tree if it is connex (it has only one connected component) and it has no cycle. It means that between any two vertices, there is exactly one path, no more nor less. Figure 2.4 represents a tree.

In a tree, a single vertex can be promoted to be the root of the tree. In a rooted tree, for a given edge eij , the vertex v i is closer to the tree root than vj . We say that vi is the parent of vj and that vj is the child of vi. The transitive closure of the child relation is called descendant relation. Thus the descendants of vi are either vi ’s children or are children of another descendant of vi . Conversely, all the vertices between vj and the root of the tree are the ancestors of vj .

### Dependency Parsing

We call a sentence x = x1x 2 : : : xn , a sequence of words, numbers and punctuation symbols of length jx j = n. In the following, we will use word as a generic term for words, numbers and punctuation symbols. A dependency relation is a directed relation between two words xi and xj such that one is the governor (or head or parent) and the other is the dependent (or child). Typically, those relations encode syntactic/semantic relations, such as the fact that a determiner or an adjective attach to a noun or that noun used as a direct object attaches to a verb. A dependency graph G over a sentence x is a graph whose vertices V = fxiji 2 Nng are the words of x in the sequential order and whose edges E = feij j(i; j) 2 N2n; i 6= jg are dependency relations between those words. The cat sat on a mat .

Different linguistic theories of syntactic dependency can lead to different de-pendency graph for a single sentence x. Some are arbitrary graphs while others are constrained. The most widely studied dependency graphs in NLP are by far dependency trees. Meaning that each word has at most one governor and that only one word has no governor. This word is the root of the tree. Figure 2.5 gives an example of such a dependency tree.

Dependency relations can further be typed. In that case we say that the depen-dency structure is labeled. Typical labels for dependency relations are syntactic roles like Subject and Object. But again, types can vary from theory to theory. Figure 2.6 gives the labeled version of the structure from Figure 2.5. An annotation scheme is the translation of a dependency theory into a set of rules used to derive trees for sentences. Whilst most rules of a scheme are directly derived from a given theory, conventions are often necessary to keep the structures constrained. For example, most theories have subjects and objects governed by verbs. However, in sentences with several possible governors like in ”Peter bought and ate apples”, where Peter could depend on both bought and ate, rules might be necessary to break ties, especially if one wants to work with trees. Dependency trees can be further qualified as projective or non-projective. A tree is projective if it can be drawn above the sentence without any crossing edges. More formally, for any dependency relation eij on x, if for all k such that i < k < j (resp. j < k < i), x k is a descendant of x i, then eij is projective. If all the dependencies of a tree are projective then the tree itself is projective.

#### Dependency Parsing as a NLP Task

The problem of dependency parsing is to find the best dependency structure (from now on, we will assume that this structure is a tree) for a sentence given a set of rules that we will define later. From a computational point of view, an algorithm to perform dependency parsing is an algorithm A of type R ! (X ! Y), where R is the set of sets of rules constraining the dependency structures, X is the set of sentences and Y is the set of dependency structures. Let r 2 R be a specific set of rules, then the instance Ar is an algorithm that takes a sentence x as input and outputs a tree y. An example of such a rule would be that an adjective appearing at the left of a noun attaches to that noun, as it is the case in French.

The sets of rules that constraint dependency trees, can be given in very different ways. The first dependency parsers used explicit hand crafted hard coded rules in the shape of dependency grammars [ KMN09]. However, this requires a huge amount of error prone work and human expertise, and because human languages evolve fast, are ambiguous and full of idiosyncrasies, such hand crafted rule sets are not able to generalise well to the broad range of language uses. Thus moderns parsers rely on implicit rules given in the shape of annotated example sentences sets called treebanks. Because of the implicit nature of those rules, parsers use machine learning techniques to learn them from annotated examples. We will describe those annotated examples in greater details in Section 4.2. Assuming one has access to a mechanism that would tell how good a tree y is for a given sentence x and a rule set r, like a scoring function for example, then we would just need to search through the space of trees and pick the best: y^ = argmax Scorer(x; y).

**Table of contents :**

**1 Introduction **

1.1 Outline

**2 Preliminaries **

2.1 Snippet of Graph Theory

2.2 Dependency Parsing

2.2.1 Dependency Parsing as a NLP Task

2.2.2 Graph-based Parsing

2.2.3 Transition-based Parsing

2.2.4 Other Approaches

2.2.5 Evaluation

2.3 Machine Learning and Structured Prediction

2.3.1 Structured Prediction

2.3.2 Learning Scoring Functions

2.3.3 Large Margin Classifiers

2.3.4 Online Learning

2.3.5 Neural Parsers

2.4 Conclusion

**3 Representing Word Information **

3.1 Lemmas, Parts-of-speech and Morphological Attributes

3.1.1 Parts-of-speech

3.1.2 Morphological Features

3.2 Learning Word Representation

3.2.1 The Different Views of a Word

3.2.2 Types of Word Representations

3.2.3 Beyond One-Hot Encoding

3.2.4 Distributional Hypothesis

3.2.5 Distributional Semantics

3.2.6 Continuous Representations

3.2.7 Discrete Representations

3.2.8 Engineering, Learning and Selection

3.3 Conclusion

**4 Related Works on Multi-Lingual Dependency Parsing **

4.1 Multi-Lingual Dependency Parsing

4.2 Universal Dependencies

4.3 Related Work

4.3.1 Delexicalised Parsers

4.3.2 Annotation Projection

4.3.3 Cross-Lingual Representations

4.3.4 Direct Transfer and Surface Form Rewriting

4.3.5 Multi-Lingual Dependency Parsing

**5 Delexicalised Word Representation **

5.1 Related Work

5.2 Delexicalised Words

5.3 Representation Learning

5.3.1 Delexicalised Contexts

5.3.2 Structured Contexts

5.3.3 Structured Delexicalised Contexts

5.3.4 Dimension Reduction

5.4 Dependency Parsing with Delexicalised Word

5.5 Experiments

5.5.1 Settings

5.5.2 Results

5.6 Conclusion

**6 Phylogenetic Learning of Multi-Lingual Parsers **

6.1 Related Work

6.1.1 Multi-Task Learning

6.1.2 Multi-Lingual Dependency Parsing

6.2 Phylogenetic Learning of Multi-Lingual Parsers

6.2.1 Model Phylogeny

6.2.2 Phylogenetic Datasets

6.2.3 Model Training

6.2.4 Sentence Sampling

6.2.5 Zero-Shot Dependency Parsing

6.3 Tree Perceptron

6.3.1 Averaging Policy

6.4 Neural Model

6.5 Experiments

6.5.1 Setting

6.5.2 Results with Training Data

6.5.3 Zero-Shot Dependency Parsing

6.6 Conclusion

**7 Measuring the Role of Morphology **

7.1 Morphological Richness

7.2 Measuring Morphological Richness

7.2.1 Related Work on Measures of Morphological Richness

7.2.2 Form per Lemma Ratio

7.3 Morphological Richness in Dependency Parsing

7.4 Measuring Morphology Syntactical Information

7.5 Annotation Scheme Design

7.5.1 Part-of-speech Tags

7.5.2 Word Tokenisation

7.5.3 Dependency Scheme

7.6 Experiments

7.6.1 Parsing Model

7.6.2 Word Representation

7.6.3 Experimental Setting

7.6.4 Results

7.6.5 Assessing the Impact of the Annotation Scheme

7.7 Conclusion

**8 Conclusion **

8.1 Contribution

8.2 Future Works

**9 Appendix **

9.1 Model Propagation for Dependency Parsing

9.1.1 Experiments

9.1.2 Results

9.2 Measuring the Role of Morphology . .