Improving Alignment with Discriminative Learning Techniques for Statistical Machine Translation

Get Complete Project Material File(s) Now! »

Asymmetric One-to-Many Methods

A first family of word alignment models recasts the problem as a sequence labeling task.Each target2 word ej is labeled with a source position i ∈ N. We denote such alignment as a to differentiate it from the unconstrained word alignment A. Formally, a is a sequence of length M of source positions. Similarly to the general case, this alignment can be seen as a function, but this time mapping positions in one sentence to positions in the other a : M → N.
The number of different possible label sequences is MN where M is the length of the target sentence and N is the size of the label set (the source sentence positions). While this number is smaller than the general case for unconstrained alignments (2NM), it is still too large to allow for exhaustive enumeration of all possible alignments.
General sequence labeling (functional, non-injective and non-surjective constraints) means that each target word is aligned to exactly one source word position, resulting in one-to-many alignments. Additional injectivity or bijectivity constraints result in one-to-one alignments. In order to allow target words not to be linked to any particular source word, the codomain of the function is usually augmented with a special null token at position (0). Linking a target word to this particular source position implies unalignment. A source position can be linked to zero or more target positions. No restriction on the distortion of the alignment is imposed, so arbitrary crossing links are permitted.
Sequence labeling constraints thus result in directional, asymmetric, many-to-one alignments. Obtaining many-to-many alignments then requires exchanging the roles of the sentences and recombining two directional alignments. Figure 2.3 show examples of two alignments in opposite directions.

Heuristic Alignments

The simplest method makes the alignment decisions only depend on the similarity between the words of the languages (Smadja, McKeown, and Hatzivassiloglou, 1996; Ker and Chang, 1997; Melamed, 2000). The Dice coefficient (Dice, 1945), log-likelihood ratio (Dunning, 1993) and p-value resulting from statistical significance tests are used to populate the alignment matrix with association scores ci;j. From this association score matrix, the word alignment is obtained by applying a sequence labeling heuristic. For instance, each target word ei is aligned to the source word with the highest association score: aj = arg maxi ci;j. The advantage of these heuristic approaches is their simplicity. However, the choice of the scoring function is arbitrary. Furthermore, the strength of the association is overestimated unless careful adjustment are taken as pointed out by Moore (2004b). Another problem is that alignment decisions are made completely independently from one another, which is clearly 2Source and target are interchangeable.

Supervised Discriminative Sequence Models

Models presented in Section 2.2.2 define the joint probability distribution p(e; ajf) and use it to infer the alignment. They are all trained in an unsupervised way. In this section we consider supervised models for sequence structure prediction.

Maximum entropy models

The simplest approach is to directly estimate, for each target word, the probability the alternative alignment decisions which range over the source positions. This can be done using a popular multi-class classification framework called MaxEnt, of which CRF is a generalization to more complex structures. Ittycheriah and Roukos (2005) propose to model the conditional alignment distribution using a log-linear model: p(aje; f) = 1 Z(; e; f) exp MX =1 p(ajjaj-1) + (1 – )p(ajjej-1 1 ; f).

Symmetric Many-to-Many Methods

Sequence labeling approaches studied in Section 2.2 produce asymmetric one-to-many alignments. However, the one-to-many assumption is over-simplistic and relies on an arbitrary choice of the alignment direction8. A different approach that does not suffer from asymmetry is to predict the binary alignment matrix. The problem is reformulated as follows:
∙ Input is a pair of sentences (e; f) ∈ × .
∙ The output is an unrestricted many-to-many word alignment A ∈ A.
This alignment is usually represented by enumerating the functions A : N ×M → f0; 1g which map the cells (i; j) in the alignment matrix to a binary value Ai;j indicating whether corresponding words are aligned or not: where 1 indicates an active link and 0 an inactive link. 7Zero-mean Gaussian prior with uniform covariance matrix is equivalent to the `2 regularization (Chen and Goodman, 1996).

Symmetrization and Alignment Combination

An increase in alignment model’s expressivity usually comes at the price of intractability, implying approximate heuristic learning and inference prone to search errors. Going beyond IBM model 2 or HMM is an example. An alternative is to combine several simple alignments to obtain a more expressive one.

Symmetrization heuristics

The simplest approach is to merge the two directional alignment functions using a symmetrization heuristic (Och, Tillmann, and Ney, 1999; Koehn, Och, and Marcu, 2003; Och and Ney, 2003). One such heuristic is to take the intersection of the two alignment sets as follows9: A = af!e ∩ ae!f. Intersection alignments matrices are sparse and encode only one-toone relationship between words. However the alignment are usually of high precision due to the agreement of both models. An alternative assumption is that the two alignments contain complementary information and their union is therefore considered instead of their intersection. Many-to-many relationship can be expressed this time and the resulting matrices tend to be highly populated. A higher recall can be achieved at the price of losing in precision.

READ The growth of interest in the cultural dimension of foreign language education

Table of contents :

Remerciements
Contents
Introduction
Current practices in bitext alignment
Issues and challenges
Improving alignments with discriminative techniques
I Bitext Alignment
1 The Alignment Problem: An Overview
1.1 Bitext Alignment
1.2 Translation and Alignment
1.2.1 Identifying the Translation Unit
1.2.1.1 Meaning-language interface
Words and concepts
Word lexical ambiguity
Word order
1.2.1.2 Translation strategy
1.2.2 Translation Units and Alignment Difficulty
1.2.3 Translation Unit and Alignment-Context Bound
1.3 Alignment Granularity
1.3.1 Document Alignment
1.3.2 Sentence Alignment
1.3.3 Sub-sentential Alignment
1.3.3.1 Word alignment
1.3.3.2 Phrase alignment
1.3.3.3 Structure and tree alignment
1.4 Applications
1.5 A Generic Framework for Alignment
1.6 Alignment Space and Constraints
1.6.1 Segment Constraints
1.6.1.1 Contiguity constraints
1.6.1.2 Length constraints
1.6.1.3 Structural constraints
1.6.2 Alignment Constraints
1.6.2.1 Structural constraints
1.6.2.2 Range constraint
1.6.2.3 Functional constraints
1.6.2.4 Bijectivity constraints
1.7 Evaluation Methods
1.7.1 Intrinsic Measures
1.7.1.1 Alignment Error Rate (AER)
1.7.1.2 Balanced F-measure
1.7.1.3 Other word-level measures
1.7.1.4 Phrase-level measures
1.7.2 Extrinsic Measures
1.7.3 Correlation
1.8 Summary
2 Alignment Models
2.1 Word-Based Alignment Models
2.2 Asymmetric One-to-Many Methods
2.2.1 Heuristic Alignments
2.2.2 Unsupervised Generative Sequence Models
2.2.2.1 Conditional Bayesian networks
Parameter estimation
Expectation-Maximization (EM)
IBM model 1
Inference and EM
Limitations
IBM Model 2
Hidden Markov Model (HMM) alignment
Inference and EM
IBM model 3
Inference and EM
IBM model 4 and beyond
Local log-linear parameterization
Discussion
2.2.2.2 Conditional Random Fields
Inference
Unsupervised parameter estimation
2.2.3 Supervised Discriminative Sequence Models
2.2.3.1 Maximum entropy models
2.2.3.2 Conditional Random Fields
Supervised parameter estimation
2.2.3.3 Large-Margin methods
2.3 Symmetric Many-to-Many Methods
2.3.1 Symmetrization and Alignment Combination
2.3.1.1 Symmetrization heuristics
Grow-diag-final-and (GDFA)
Generalizing the symmetrization
Application-driven combination
2.3.1.2 Agreement constraints
2.3.1.3 Discriminative combination
2.3.2 Weighted Matrix Based Methods
2.3.2.1 Minimum Bayes-risk decoding
2.3.2.2 One-to-many constraints
2.3.2.3 One-to-one constraints
2.3.2.4 Alignment as assignment
2.3.2.5 Alignment as matrix factorization
2.3.3 Generative Many-to-Many Models
2.3.4 Global Discriminative Models
2.3.4.1 CRF-based matrix modeling
2.3.4.2 Other models
2.4 Syntactic and Hierarchical Alignments
2.4.1 Inversion Transduction Grammars
2.4.2 parameterization and Learning
2.4.3 Syntactic Constraints
2.4.4 Other Syntax-Based Models
2.5 Phrase-Based Alignment Models
2.5.1 Bisegmentation
2.5.1.1 Generative models
Hidden semi-Markov models
The degeneracy problem
2.5.1.2 Bayesian models
2.5.1.3 Discriminative models
2.5.2 Generalized Phrase Alignment
2.5.2.1 Extraction heuristics
The standard approach
Weighted phrase-based matrix
2.5.2.2 Translation spotting
2.5.2.3 Discriminative models
2.6 Features
2.6.1 Type
2.6.2 Indicators of alignment
2.6.3 Scope
2.7 Summary
3 Phrase based SMT
3.1 Phrase-Based Translation Model
3.2 Modeling and Parameter Estimation
3.2.1 Discriminative Translation Models
3.2.2 Bilexicon Induction
3.2.3 Features
3.2.4 The Phrase Table
3.2.5 Learning in Discriminative Models
3.3 Decoding
3.4 Evaluating Machine Translation
3.5 Summary
II Improving Alignment with Discriminative Learning Techniques for Statistical Machine Translation
Research Statement
4 MaxEnt for Word-Based Alignment Models
4.1 Word Alignment as a Structured Prediction Problem
4.2 The Maximum Entropy Framework
4.3 Minimum Bayes-Risk Decoding
4.4 Parameter Estimation
4.5 The Set of Input Links
4.6 Features
4.6.1 Word Features
4.6.2 Alignment Matrix Features
4.6.3 Partitioning Features
4.7 Stacked Inference
4.7.1 The Stacking Algorithm
4.7.2 A K-fold Selection Process
4.7.3 Stacking for Word Alignment
4.8 Experimental Methodology
4.8.1 Experimental Setup and Data
4.8.2 Arabic Pre-processing
4.8.3 Remappings Alignments
4.9 Results
4.9.1 Comparison to Generative “Viterbi” Alignments
4.9.1.1 Baselines: IBM and HMM models
4.9.1.2 MaxEnt and stacking
4.9.2 Pruning and Oracle Study
4.9.3 Discriminative Training Set Size
4.9.4 Features Analysis
4.9.4.1 First feature group
4.9.4.2 Second feature group
4.9.5 Precision-Recall Balance
4.9.6 Regularization
4.9.7 Search Space and Window Size
4.9.8 Input Alignments Quality
4.9.9 Model and Feature Selection
4.9.10 A Comparison with Weighted Matrix Based Alignments
4.9.10.1 Viterbi IBM and HMM models
4.9.10.2 N-best heuristic
4.9.10.3 PostCAT
4.9.10.4 CRFs
4.9.10.5 MaxEnt
4.10 Error Analysis
4.11 Summary
5 MaxEnt Alignments in SMT
5.1 Phrase Table Construction
5.1.1 A General Framework
5.1.2 Viterbi-Based (Standard) Approach
5.1.3 WAM-based Instantiation
5.1.3.1 Evaluation and counting functions
5.1.3.2 Alignment constraints and selection criteria
5.1.3.3 Translation model scores
5.2 Experiments
5.2.1 Viterbi-Based Extraction
5.2.1.1 Large scale systems
MaxEnt vs. IBM and HMM models
Correlation between AER and BLEU
5.2.1.2 A study of alignment characteristics
5.2.2 Weighted Matrix Based Extraction
5.2.2.1 Results and discussion
MGIZA++
N-best WAM
PostCAT
CRF
Maximum Entropy (MaxEnt)
5.2.2.2 Discussion
5.3 Summary
6 Supervised Phrase Alignment with SCC
6.1 Supervised Phrase-Pair Extraction
6.1.1 Single-Class Classification (SCC)
6.1.2 Phrase Translation Model Training Algorithm
6.1.3 Balancing Precision and Recall
6.2 Learning the Single-Class Classifier
6.2.1 One-Class SVM (OC-SVM)
6.2.2 Mapping Convergence (MC)
6.2.3 ˆPP Measure and Classifier Selection
6.3 Oracle Decoder for Building the Set of Positive Examples
6.4 Feature Functions
6.4.1 Weighted Alignment Matrix (WAM)
6.4.2 Word Alignments (WA)
6.4.3 Bilingual and Monolingual Information (BI, MI)
6.4.4 Statistical Significance (Pval)
6.4.5 Morpho-Syntactic Similarity (MS)
6.4.6 Lexical Probability (LEX)
6.5 Experiments
6.5.1 Data and Experimental Setup
6.5.2 Classification Performance: ˆPP
6.5.3 Translation Performance: BLEU
6.5.3.1 Phrase pairs scoring method
6.5.3.2 Using additional phrase table features
6.5.4 Discussion
6.6 Summary
Conclusion
Contributions
Future Work
Publications by the Author
Bibliography