Heterogeneity in parsing: monolingual and cross-lingual issues

Get Complete Project Material File(s) Now! »

Improving the classifiers: from the averaged perceptron to Stack-LSTMs

Since Collins and Roark (2004), the main classifier used to predict transitions in parsing has been the averaged multi-class perceptron.
The multi-class perceptron is a linear model that scores each possible decision t in a given configuration c, by means of a cross-product between a feature vector f representing the configuration, and a weight vector w. In transition-based parsing, the next transition to apply is thus: t = argmaxt2VALID(c) w fc,t where VALID(c) denotes the set of transitions that can be applied in configuration c (that is, whose preconditions are met).
The weight parameters are trained using annotated data. For each training sentence, a derivation leading to the gold tree is given by an oracle; for each configuration along this derivation, the predicted action t􀀀 is compared to the gold action t+ (see Figure 3.4a). If they differ, the perceptron update rule is applied: w w + fc,t+ 􀀀 fc,t􀀀.
In the averaged perceptron variant (Freund and Schapire, 1999), all values taken by the w vector throughout training are remembered, and the value retained for w at the end of the training is not its last value, but the average of all its values through history. This improves online training by alleviating its bias toward the last configurations seen. Averaging does not alter the convergence guarantees of perceptron updates, but it makes the long-term effect of each update harder to interpret than for standard training. For instance, if two sentences in the training data lead to opposite updates, in standard training their effects cancel each other out, while with averaging, the final value of the parameters depends on the time that has elapsed between both sentences: the longer the delay, the more important the first update is in the average.

Beam search and global training

One major challenge transition-based parsing has to face is the locality of the decisions. Indeed, the dependency structure of a free text sentence often implies longdistance interactions: interpolated clauses, long dependencies, high arity nodes, multi-clause sentences, etc. In such cases, the classifier needs information from a distant context to decide on the next action to take. While Zhang and Nivre (2011)’s feature templates introduce some non-local information, they do not cover all the potentially relevant nodes, which can be arbitrarily far.

Globalization of local parsers

To overcome this problem, the standard strategy, first applied on transition-based dependency parsing by Johansson and Nugues (2006), is to use beam search: instead of predicting the transition sequence greedily, by selecting the one-best action at each step, with beam search the parser maintains an agenda of the k best (partial) derivations so far, exploring new transitions from all of them at once. To discard the worst hypotheses in the beam at each step, and maintain its size at k, the derivations are compared using the sum of all the transition scores (local scores) in the sequence.
Thus, each partial analysis is assigned a global score, and in the end the best parse is selected as a whole; this closes the gap to the global decisions taken in graph-based parsing.

Role of the beam

The effect of beam search on the parsing process can be interpreted in four ways. As delaying difficult decisions: Some decisions cannot be taken locally because they require information that will only be available later. By keeping in the beam most of the paths from this point on, the parser is able to wait until it knows the future cost of each action: the earlier decisions are settled when a high score is assigned to some later, more informative features.
As a means of error recovery: The beam marginally compensates for a model that is not perfectly trained. Indeed, if two actions have very tight scores, it may be that the difference results from a training error, in which case it is safer to wait for a later decision whose margin is large enough to be reliable. With greedy parsing that takes decisions immediately, any error is unrecoverable. As a solution to the label bias problem (Bottou, 1991; Lafferty et al., 2001): Even with a perfectly trained model, it may be mathematically impossible to model some dependency structures, because the globally optimal action is truly locally suboptimal.
In such cases, dead end actions can hide the optimal derivation, which is currently low-scored, and mislead a greedy parser; beam search prevents this effect by exploring the neighbourhood of the locally optimal derivation.
As modeling actual ambiguities: In some cases, sentences are truly ambiguous at start, even for a human being. When the wrong interpretation remains dominant for a large part of the sentence, it is referred to as a garden path sentence (Frazier, 1979). In this case, a greedy parser will likely stick to the wrong interpretation. Thus, keeping several hypotheses in the beam enables to maintain concurrent interpretations of the beginning of the sentence, up to the disambiguating point.

READ EDUCATION AND INFORMATION COMMUNICATION TECHNOLOGY

Table of contents :

Abstract
Remerciements
List of Tables
List of Figures
Glossary
1 Introduction
1.1 Maximizing the usability of existing resources
1.2 Problem statement: multi-(re)source combination
1.3 Dependency parsing as a case study
1.4 Contributions
1.5 Outline
I The state of the art in cross-lingual parsing
2 Cross-lingual transfer and multilinguality
2.1 Cross-lingual bridge: an attempt at an exhaustive typology
2.2 State-of-the-art methods for cross-lingual transfer
2.2.1 Meta-methods in data and parameter spaces: a reading grid
2.2.2 Data space transfer
2.2.3 Parameter space transfer
2.2.4 An ambivalent case: direct delexicalized transfer
2.2.5 Common representations
2.3 Recent advances: cross-lingual embeddings and multilingual models .
2.4 Evaluation of cross-lingual transfer
2.5 Conclusion: a flourishing field, yet hard to categorize
3 Recent advances in transition-based dependency parsing
3.1 The dependency parsing formalism
3.2 Transition-based parsing: strengths and limitations
3.3 Transition systems
3.3.1 Stack-based systems
3.3.2 Understanding derivations: action-edge correspondence
3.4 Improving the classifiers: from the averaged perceptron to Stack-LSTMs
3.5 Beam search and global training
3.5.1 Globalization of local parsers
3.5.2 Role of the beam
3.5.3 Structured perceptron
3.5.4 Partial updates
3.5.5 Use in recent parsers
3.6 Dynamic oracles
3.6.1 Issues of static oracles
3.6.2 Training with dynamic oracles
3.6.3 Practical computation of action costs
3.6.4 Use in recent parsers
3.6.5 Differences with beam parsing
3.7 Conclusion: four open avenues
4 Cross-lingual parsing
4.1 Methods for cross-lingual parsing
4.1.1 Direct transfer
4.1.2 Parser projection
4.1.3 Training guidance
4.1.4 Joint learning
4.2 Cross-treebank consistency
4.3 Evaluation of cross-lingual parsing
4.4 Relations with domain adaptation
4.5 Effective use of cross-lingual resources
4.6 Conclusion: softer requirements, finer combination
II Cross-lingual parsing for low-resourced languages: an empirical analysis
5 How good are we at cross-lingual parsing?
5.1 Assessing transfer usefulness in a case study
5.2 Investigating tiny treebanks as an alternative to transfer
5.2.1 Selecting a competitive tiny parser
5.2.2 An interpretable quality scale
5.2.3 Application: parsing capacity of cross-lingual and unsupervised parsers
5.3 The optimistic approach: opposites attract
5.4 Complementarity from a linguistic standpoint
5.4.1 A motivating example: Romanian
5.4.2 A bilingual typology of knowledge
5.4.3 Target-oriented relevance
5.5 Wrap-up: not so good, but hope prevails
6 Heterogeneity in parsing: monolingual and cross-lingual issues
6.1 Parser usefulness: heterogeneity matters
6.2 Empirical evidence of treebank heterogeneity
6.3 Lexicalization and explaining away
6.3.1 Hands-on measures
6.3.2 Generalization and explaining away effects
6.4 Class hardness
6.4.1 Existing metrics to assess task difficulty
6.4.2 Characterizing difficulties: class hardness
6.4.3 Comparative evaluation
6.5 Interactions of unbalanced classes
6.5.1 Accuracy experiments: knowledge does not flow between similar classes
6.5.2 Feature-level experiments: complex classes result from unstable parameters
6.5.3 Word-level subsampling
6.6 Wrap-up: more is both more and less
III A new framework for cross-lingual transfer
7 Developing a more flexible parsing system
7.1 An increased need for flexibility
7.1.1 Non-standard parsing tasks
7.1.2 Additional benefits of dynamic oracles
7.1.3 Dynamic oracles for global training
7.2 Extensions of dynamic oracles
7.2.1 Redefining action cost to cover more ground: non-arcdecomposable systems and non-projective trees
7.2.2 Global dynamic oracles
7.2.3 Framework unification
7.3 PanParser: a flexible parser with a modular architecture
7.4 Partial training: an application to cross-lingual transfer
7.5 Partial transition systems: learning to ignore
7.6 Parsing under constraints: exploiting prior annotations
7.7 Conclusion: a dynamic oracle is all you need
8 Incompatible representations of knowledge
8.1 Typological differences incur transfer impossibilities
8.1.1 Order-dependent knowledge representation
8.1.2 Transfer failures
8.2 Reshaping training instances
8.2.1 Target-optimal reordering with a language model
8.2.2 Heuristic rewrite rules based on typological knowledge
8.3 Experiments
8.3.1 Experimental setup
8.3.2 Results
8.3.3 A fine-grained analysis
8.4 Qualitative comparison of relevance models
8.5 Conclusion: beyond the proof of concept
9 Cascading models for multi-(re)source selective transfer
9.1 Transfer cascades
9.1.1 Core method
9.1.2 Relations with the meta-learning literature
9.1.3 Algorithmic refinements
9.2 Concrete measures of relevance
9.2.1 Using a development set
9.2.2 Treebank similarity with a target sample
9.2.3 Approximation with raw target data
9.2.4 Another use of typological knowledge
9.3 Monolingual cascades
9.3.1 Multi-system and multi-domain cascades
9.3.2 Divide-and-conquer cascades
9.4 Experiments in realistic conditions
9.4.1 System components
9.4.2 Monolingual and cross-lingual cascades
9.4.3 Per-treebank results
9.5 Conclusion: the road ahead
IV Conclusion
10 Summary and perspectives
Appendices
A Advanced strategies for global training in parsing
A.1 Restart strategy with global dynamic oracles
A.2 Using global dynamic oracles to correct training biases
A.3 Experiments
A.4 Conclusion
B Word alignment transfer
B.1 Aligning words
B.2 Word alignments: cross-lingual scenarios
B.3 Concrete methods for transferring alignments
B.4 Experiments
B.5 Conclusion
C Leveraging lexical similarities: zero-resource tagger transfer
C.1 Perceptron-based PoS tagging
C.2 Modifying feature representations
C.3 Experiments
C.4 Conclusion
Résumé détaillé
List of publications
Bibliography