On-demand system construction with contextual sampling

Get Complete Project Material File(s) Now! »

Computer-Assisted Translation (CAT)

Although SMT has made large progress in recent years, SMT systems are barely used without any human intervention in professional translation domain as they cannot deliver high-quality translations for most translation tasks. A common practice in the industry is to provide SMT output to human translators for post-editing, a strategy that has been shown to be more efficient than translation from scratch in a number of situations [Plitt and Masselot, 2010, Federico et al., 2012]. However, no real interaction is involved in post-editing, the human translator simply correcting and improving the system’s translation output.
Interactive Machine Translation (IMT) was pioneered by projects such as TransType [Langlais et al., 2000], where a SMT system assists the human translator by proposing translation completions that the translator can accept, modify or ignore. IMT was later further developed to enable more types of interaction [Barrachina et al., 2009, Koehn, 2009] and integrating the result of the interaction to influence future choices of the system. More recently, online learning was introduced in the IMT framework [Ortiz-Mart´ınez et al., 2010] to improve the exploitation of the translator’s feedback.

Scaling to larger corpora

Typical SMT systems pre-compute all the necessary information in advance once the (initial) parallel data are available. The size of the resulting translation models quickly grows with the size of the parallel data used, which outstrips improvements in computing power and certainly hinders research on many types of new models and the experimental exploration of their variants. Callison Burch et al. [2005] and Lopez [2008b] presented a solution for developing SMT systems borrowing from EBMT. The key idea is that translation rules and models are on-the-fly computed only as needed for each particular input text. Given an input text d to translate, the set of all potentially useful phrases, denoted [d], is first extracted. For each extracted phrase  s 2 [d], its occurrences C[s] in the training corpus C are then located. Using a pre-existing word alignment A, the translations of a source phrase are extracted from its examples and used to compute its translation model parameters s. This process is repeated for all source phrases s in [d], and a translation table is produced and subsequently used by a decoder to translate the input document. More formally, this procedure is sketched in Algorithm 1.

Incremental alignment

Word alignment estimation is the most computationally expensive process in SMT system development.
The current state-of-the-art word alignment tool giza++, implementing a batch training regime, is based on IBM word alignment models. A conventional setting is to run a number of iterations of each IBM 1, HMM, IBM 3 and IBM 4 models in order. The output of previous iteration is used to initialize the next one. This process repeatedly analyzes the whole parallel corpus to collect statistics and train IBM alignment models. IBM model 1 and model 2 can be learned efficiently in polynomial time, while model 3 and model 4 are computationally much more expensive and resort to hill-climbing techniques. The whole process is time-consuming, especially when the parallel corpus is huge, so that re-training the complete alignment when a proportionally low number of new sentences are added is a waste of resource. Significant computational resources may be saved if word alignments are only computed on the additional data. However, training word alignment models on small amounts of data results in poor models. Hence, generating word alignments for additional data is an important issue for the integration of new data.
Some previous works, including e.g. [Lavergne et al., 2011], resorted to a sub-optimal forced alignment approach. Here, the alignment model trained on old data is used to align the additional data. This approach is sub-optimal because the statistics of the additional data are not collected and thus not used to improve either its word alignment or the word alignment of the old data. Forced alignment is usually used when only a small proportion of additional data is added, where such small quantities of additional data do not have a significant impact on the word alignment quality. But in order to achieve better performance, system developers typically choose to re-train the whole alignment models from scratch on the new data.

READ  Camera matrix calibration using circular control points 

Instance weighting

Unlike data selection, which involves the binary hard decision (including or discarding) for sub-corpora, instance weighting approaches make soft decisions by assigning a weight to each unit. The most relevant units get relatively higher weights, and the least relevant parts get lower weights, possibly a null weight 4. The empirical phrase counts are modified using these weights and the translation feature scores are modified accordingly. Instance weighting has been applied on different unit types at different levels of granularity: sub-corpus, sentence pairs and phrase pairs. Sennrich [2012] incorporated out-of-domain corpora using a weighted combination. Each sub-corpus is associated with a weight used to combine different sub-corpora in two ways: Linear interpolation: p(tjs; ) = Pn i=1 ipi(tjs), where n is the number of sub-corpora, pi(tjs) the translation model trained on each sub-corpus i, and i the interpolation weight of each model i. Weighted count: p(tjs; ) = Pn P i ici(s;t) n i P t0 ici(s;t0) , where c denotes the count of an observation, the observation in each sub-corpus i being weighted by i. The main difference to linear interpolation is that this equation takes into account how well-evidenced a phrase pair is.
This includes the distinction between lack of evidence and negative evidence, which is
missing in a naive implementation of linear interpolation.

Table of contents :

I Development of Statistical Machine Translation Systems 
1 Overview of Statistical Machine Translation 
1.1 Statistical machine translation
1.2 Word alignment
1.3 Phrase-based probabilistic models
1.3.1 Target language model
1.3.2 Reordering models
1.4 Decoding and tuning
1.4.1 Scoring function
1.4.2 Automatic evaluation
1.4.3 Parameter tuning
1.5 Computer-Assisted Translation (CAT)
1.6 Summary
2 Improving SMT System Development 
2.1 Scaling to larger corpora
2.2 Incremental system training
2.2.1 Incremental alignment
2.2.2 Model integration
2.3 Data weighting schemes
2.3.1 Context-dependent models
2.3.2 Data selection
2.3.3 Instance weighting
2.3.4 Multi-domain adaptation
2.3.5 Translation memory integration
2.4 Summary
II On-demand Development and Contextual Adaptation for SMT
3 On-demand Development Framework of SMT Systems 
3.1 On-demand word alignment
3.1.1 Sampling-based transpotting
3.1.2 Sub-sentential alignment extraction
3.1.3 Difference with the giza++ alignment process
3.2 On-the-fly model estimation
3.3 On-demand system development
3.4 Experimental validation
3.4.1 Data
3.4.2 Validation of on-the-fly model estimation
3.4.3 Validation of on-demand word alignment
3.4.4 Validation of the framework
3.5 Summary
4 Incremental Adaptation with On-demand SMT 
4.1 Translation for communities with on-demand SMT
4.1.1 No cache, no tuning (Config0)
4.1.2 Using a cache (Config1)
4.1.3 Sampling by reusing aligned sentences (Config2)
4.1.4 Plug-and-play data integration (+spec)
4.1.5 Simple online tuning (+online)
4.2 Any-text translation with on-demand SMT
4.3 Summary
5 Contextual Adaptation for On-demand SMT 
5.1 Contextual sampling strategies
5.1.1 N-gram precision
5.1.2 TF-IDF
5.1.3 On-demand system construction with contextual sampling
5.2 Confidence estimation of adapted models
5.3 Experimental validation
5.3.1 Results
5.4 Contextual adaptation of on-demand SMT systems
5.5 Summary
Appendix A Abbreviation 
Appendix B Extracts of Data 
B.1 Newstest
B.2 WMT’14-med
B.3 Cochrane
Appendix C Documents in Any-text Translation 
Appendix D Translation for communities: translation examples 
Appendix E Publications By the Author 


Related Posts