Unsupervised word discovery
Documenting the lexicon is an important task in endangered language documentation (see Haig et al., 2011, inter alia). Assuming speech can be automatically and reliably transcribed into a sequence of phone-like units (following the methodology of the BULB project, see Section 1.2.2), the word discovery task consists to segment this sequence into words. In this document, we refer to this task indifferently as word discovery, or word segmentation. In our context, and for reasons invoked in Section 1.2.3, we attempt to perform word segmentation without supervision, or minimal supervision.
This task can be carried out in a monolingual setting. But with bilingual data, word discovery can be tightly coupled with word alignment: segmentation can be guided by the alignment between phone-like units on one side of the bilingual corpus, and words on the other side; conversely, automatically segmented words can be aligned to their well-resourced counterpart. A formal description of both tasks is in Section 2.1. For reasons explained in Section 3.1.1, and to make results presented in different chapters of this thesis comparable, we consider graphemic transcriptions made by linguists instead of automatically transcribed phone-like units in our experiments. We investigated the latter, more realistic, condition in (Godard et al., 2018a,d), as well as in (Ondel et al., 2018), and we refer to this work in several places.
Our work embraces the goals of the BULB project, and more largely CLD, to support the work of linguists in the documentation of endangered languages via automatic processing. More specifically, we first aim at benchmarking existing word discovery methods, in order to assess their usefulness for CLD, and their applicability within a language documentation scenario. Our goal is then to propose improvements to the most promising techniques, and make progress towards providing linguists with useful tools in their workflow. In our approach, we examine how, in a low-resource context, word segmentation can be improved with auxiliary information, such as expert knowledge, tonal patterns, or bilingual supervision. At the end of Chapter 2 (Section 2.7), and after introducing key concepts and problems more technically, explicit research questions motivating our work will be presented and discussed.
Word segmentation and alignment
As discussed in Chapter 1, collecting annotated data is costly, and non practical to meet the challenges of documenting a large number of endangered languages. Consequently, the work presented in this thesis is concerned with unsupervised, or minimally supervised, automatic processing of the “raw” bilingual data at our disposal after collection (see Section 1.2.2).
Such data, in the BULB project’s methodology, consist in pairs of mutually translated sentences in the unwritten language 1 (henceforth UL) and in the well-resourced language 2 (WL). A sentence in the UL is a sequence of L units, = 1; : : : ; l; : : : ; L, and a sentence w in the WL is a sequence of I units, w = w1; : : : ;wi; : : : ;wI . In practice, units l in the UL can correspond to transcribed characters, phones, pseudo-phones,3 phonemes, or even speech frames. Units wi in the WL, on the other hand, correspond to transcribed words.
Two sides of the same problem
One key step in documenting an UL is to identify (parts of) the lexicon, a central problem addressed in this work. However, to be fully usable by linguists, language learners, ethnologists, etc., discovered lexical units in the UL need to be associated with their counterpart in the WL, and therefore with of proxy of their meaning. We are thus facing two problems:
A segmentation problem, as we need to transform a continuous sequence of units.
in the UL into words or subword units (see Figure 2.1a).
An alignment problem, as we need to map unknown discovered units in the UL.
with known units in the WL (see Figure 2.1b).
It is natural to think of the segmentation problem for the UL side as a preprocessing task before one can perform an alignment to the word units in the WL. This approach, depicted Figure 2.2a, is indeed taken by many researchers in order to align comparable.
Early models for unsupervised string segmentation
We start our review of previous work related to word segmentation with three early approaches: the first using transition statistics, the second introducing a particular use of HMM models, and the third relying on ideas related to data compression. Many vari24 ants of these approaches have been subsequently proposed for unsupervised morpheme analysis in the context of the Morpho Challenge (Kurimo et al., 2010).
Harris (1955) pioneered automatic morphology discovery, observing that transitions between morphemes inside a word are less predictable than transitions between phonemes within a morpheme. Counting the number of phonemes that could extend any prefix into another legal prefix in the language – the successor frequency – it is possible to introduce without any supervision a boundary, within a word, at positions that correspond to peaks of that frequency.9 This approach, and its information-theoretic interpretations using mutual information and entropy measures, proved to be extremely influential. Déjean (1998), in particular, devised an unsupervised morpheme discovery procedure using Harris’ local association statistics during a bootstrapping step, subsequently expanding the morpheme list with morphemes appearing in similar contexts to the ones already discovered.
This strategy can be applied to the word segmentation task, observing that transitions between phonemes at word boundaries are also less predictable than within words. A variant (Saffran et al., 1996) uses “transitional probabilities” between syllables, i.e. the conditional probability of a syllable given the previous syllable, to identify word boundaries. This leads, however, to poor results on realistic corpora, as demonstrated by Brent (1999). Another application of this principle, this time as a preprocessing step, can be found in (Besacier et al., 2006), in an early work pursuing translation from speech (see Section 2.6.1).
Table of contents :
List of Figures
List of Tables
List of Acronyms
1.1 Motivation: language endangerment
1.1.1 Magnitude of the issue
1.1.2 Consequences of language loss
1.1.3 Response of the linguistic community
1.2 Computational language documentation
1.2.1 Recent work
1.2.2 The BULB project
1.3 Scope and contributions
1.3.1 Unsupervised word discovery
1.3.2 Outline of the thesis
1.3.4 Author’s publications
2.1 Word segmentation and alignment
2.1.1 Two sides of the same problem
2.2 Early models for unsupervised string segmentation
2.2.1 Pioneer work
2.2.3 Minimum description length principle
2.3 Learning paradigms
2.3.2 Signatures as finite state automata
2.4 Nonparametric Bayesian models
2.4.1 Stochastic processes
2.4.3 Goldwater’s language models
2.4.4 Nested language models
2.4.5 Adaptor Grammars
2.5 Automatic word alignment
2.5.1 Probabilistic formulation
2.5.2 A series of increasingly complex parameterizations
2.5.3 Parameters estimation
2.5.4 Alignments extraction
2.6 Joint models for segmentation and alignment
2.6.1 Segment, then align
2.6.2 Jointly segment and align
2.7 Conclusion and open questions
3 Preliminary Word Segmentation Experiments
3.1.1 A favorable scenario
3.1.2 Challenges for low-resource languages
3.2 Three corpora
3.2.1 Elements of linguistic description for Mboshi and Myene
3.2.2 Data and representations
3.3 Experiments and discussion
3.3.1 Models and parameters
4 Adaptor Grammars and Expert Knowledge
4.1.1 Using expert knowledge
4.1.2 Testing hypotheses
4.1.3 Related work
4.2 Word segmentation using Adaptor Grammars
4.3.1 Structuring grammar sets
4.3.2 The full grammar landscape
4.4 Experiments and discussion
4.4.1 Word segmentation results
4.4.2 How can this help a linguist?
5 Towards Tonal Models
5.2 A preliminary study: supervised word segmentation
5.2.1 Data and representations
5.2.2 Disambiguating word boundaries with decision trees
5.3 Nonparametric segmentation models with tone information
5.3.1 Language model
5.3.2 A spelling model with tones
5.4 Experiments and discussion
5.4.2 Tonal modeling
6 Word Segmentation with Attention
6.2 Encoder-decoder with attention
6.2.1 RNN encoder-decoder
6.2.2 The attention mechanism
6.3 Attention-based word segmentation
6.3.1 Align to segment
6.3.2 Extensions: towards joint alignment and segmentation
6.4 Experiments and discussion
6.4.1 Implementation details
6.4.2 Data and evaluation
7.1.2 Synthesis of the main results for Mboshi
7.2 Future work
7.2.1 Word alignment
7.2.2 Towards speech
7.2.3 Leveraging weak supervision
7.3 Perspectives in CLD
Summary in French