Get Complete Project Material File(s) Now! »

## Multigrams

The transitional methods we just mentioned focus locally on particular statistics; they do not represent explicitly the words, nor the structure of their sequence, to place boundaries. In contrast, Deligne et al. (2001) expose diﬀerent concepts introduced through a series of papers addressing the problem of segmenting one or multiple streams of symbols in a non-supervised manner. Here, the focus is to learn variable-length dependencies inside strings of phonemes or graphemes.

Deligne and Bimbot (1995) present the “multigram model” in which a sentence (or more generally a stream of symbols) is seen as the concatenation of independent se-quences of words (resp. symbols). Those sequences’ length can vary up to a maximum length n. One way to look at the n-multigram model is to think of it as a n-state Hidden Markov Model with state i emitting only sequences of length i and all transi-tion probabilities being equal (see Figure 2.4). This allows for an eﬃcient training of the model using the EM algorithm and a forward-backward procedure so as to avoid enumerating all the possible segmentations.

This forward-backward procedure slightly diﬀers from the standard approach used for HMM training, in order to take into account the dependency of the number of emis-sions with respect to the segmentation. The multigram model is compared successfully, as a language model, to n-gram models of diﬀerent order. It is also subsequently imple-mented in the context of the unsupervised segmentation of phoneme strings (Bimbot et al., 1995).

The model is then extended in (Deligne et al., 1995) to a joint version that is able to deal with two or more streams of symbols, which are themselves seen as diﬀerent transcodings of the same higher level symbol stream. To make the model tractable, it is again necessary to limit the maximum length of units in both streams. A m; n joint multigram model can then be identified to a HMM where each of the m n states emits pairs of i symbols in one stream, and j symbols in the other. This new joint multigram model is able to learn joint segmentations through many-to-many sequential pairings. In this respect, this last extension of multigrams belongs to the family of joint models studied in Section 2.6. The authors remark that when using this model to learn graphemic and phonetic pairings from words, the extracted joint units often have a morphological interpretation.

**Minimum description length principle**

The minimum description length principle (MDL) was introduced by Rissanen (1989). It rests on the assumption that the model M (among all models in a given set M) that best explains the regularities in some data D will also be the model that allows for the highest compression of the data.10 More formally, where L(M) is the code length (or “description length”) needed to specify model M, and L(D j M) is the code length of data D using the code provided by M.

De Marcken (1996) proposes that, if a lexicon is given, it is possible to learn a locally optimal parsing of the input data via the EM algorithm. In the case of a parsing that involves only concatenation and according to the work of Deligne and Bimbot (1995) (Section 2.2.2), this corresponds to an implementation of the Baum-Welch algorithm, that is, EM with a forward-backward procedure. From there, it is possible to infer probabilities for the lexicon, hence its code length.11 In MDL terms, the main idea consists in positing that, if a lexicon (the model) minimizes both its own description length (i.e. the space needed to encode it) and the description length of the data, then that lexicon is the theory that best explains the data, and should be able to capture some of the principles at work in the language that originated the data.

The diﬃculty, as with most approaches encountered in the MDL framework, is that the search space M (here the set of possible lexicons) can be very large, if not infinite. This makes search heuristics necessary, which will be likely to harm the principled nature of the model, and its interpretability. A strong case, in that respect, is made by the work of Goldwater (2006), discussed in Section 2.4.3.

The first line ensues from the fact that, under M, the optimal code length in bits for D is approximately log P (D j M). 12 Therefore the MDL approach is equivalent to assuming a prior probability for the hypothesized model M, L(M), that decreases exponentially with its code length, while L(D j M) corresponds to the likelihood of the data given this model.

Described in an influential series of papers spanning over a decade, Morfessor13 (and all its variants) has been established as a de facto standard for unsupervised mor-phology learning, especially for modeling agglutinative morphology. Its reliance on the MDL principle is described in (Creutz and Lagus, 2002), where the corpus code length is decomposed, similarly to De Marcken’s approach with words mentioned above, into the code length of a morph14 dictionary, computed as the sum of the morph lengths, and the code length of the corpus with each morph m coded with log P (m) bits. This model is refined in (Creutz, 2003), where the morph generation model is replaced by a unigram of characters, and where a more complex prior on the morph codebook, integrating both length and type distributions, is used. Creutz and Lagus (2004) in-troduce some morphotactics, with distinct hidden states for prefix, stem and suﬃxes, 11The correspondence between code length functions and probability distributions is central to MDL.

12More precisely, d log P (D j M)e, as the logarithm may not be equal to an integer. The building

of such optimal codes is detailed in, e.g., (Cover and Thomas, 2006).

**LEARNING PARADIGMS **

while Creutz and La gus (2005) (see also (Creutz and Lagus, 2007) for a more compre-hensive presentation), allow for segmentation to be performed recursively.15 Kohonen et al. (2010) first attempt to introduce annotated (i.e. segmented) data in conjunction with non-annotated data, using the model of Creutz and Lagus (2005); as in most semi-supervised approaches, the objective function combines two likelihood terms that need to be carefully weighted. Grönroos et al. (2014) describe the latest evolution of Morfes-sor, trying to better combine the benefits of semi-supervision and richer morphotactics.

**Learning paradigms**

In Section 2.1.3, we mentioned that morphology learning can be seen as building paradigms. This is an important approach that we account for briefly here, as this will not be an approach we pursue.

**Signatures**

The work of Goldsmith (2001), as stated in Section 2.1.3, lies somewhere between the learning of morphological segmentation procedures, as well as the identification of morpheme inventories, and a more paradigmatic approach. Signatures, the key concept in this work, can be viewed as a weaker form of linguistic paradigms, consisting of sets of suﬃxes that systematically alternate with a set of stems (see Figure 2.5). The approach is based again on the MDL principle (Section 2.2.3), which is instantiated here as follows: the model is made of sets of stems, suﬃxes, and signatures, which record the possibility that a stem and a suﬃx can actually co-occur. Denoting t a stem, f a suﬃx, w a word, and a signature, the probabilistic model which underlies the compression algorithm can be expressed as: is more ad hoc, a problem often seen in MDL approaches, as mentioned in Section 2.2.3. An interesting question also raised by this work is the completeness of signatures: for long signatures, e.g. a complete verbal paradigm, many sub-signatures also exist in the data (corresponding to partial paradigms); how can they be merged since there is no way to “hallucinate” forms that are not in the original list, as this would not help to compress the data?

**Signatures as finite state automata**

Hu et al. (2005) further expand this trend of research, interpreting Goldsmith’s sig-nature as a finite-state automaton (FSA) (see Figure 2.6) built from character-based alignments between pairs of word forms. These alignments are established using the string edit distance (SED), identifying perfect and imperfect character’s spans, corre-sponding respectively in the FSA to adjacent states with either one transition or two transitions. The FSAs extracted from pairs of words in the corpora (here a Swahili translation of the Bible) are then collapsed, disambiguated, and scored in a manner reminiscent of the MDL.

The work of Dreyer and Eisner (2011) is the culmination of a series of papers (Dreyer et al., 2008; Dreyer and Eisner, 2009) aimed at learning word-based models of morphology (Aronoﬀ, 1976; Blevins, 2006). Under this view of morphology, morphological processes cannot be reduced to the concatenation of segmental strings to a stem; instead, mor-phology should attempt to model the relations between forms within paradigms – a notion that should therefore be given a first class status in this theory.

From a computational perspective, segmentation is no longer the main target. In-stead, the model should be able to: i) cluster related forms within paradigms; ii) learn the mapping between forms and slots in the paradigm; iii) predict the realization of slots that are not observed in the corpus, all this in an almost unsupervised fashion. In the work of Dreyer and Eisner (2011), the supervision consists mostly of an abstract descrip-tion of the paradigm’s cells and of a handful of exemplar paradigms. In a nutshell, this work relies on two main components. The first component is a finite-state probabilistic model for morphologically related forms, which should capture the surface similarity and systematic alternations between forms within a paradigm. The second component is a nonparametric Bayesian model, which takes care of the statistical regularities of the distribution of types, inflections, and forms.

**Nonparametric Bayesian models**

We just mentioned the use of nonparametric Bayesian models to learn morphological paradigms. In another context, the modeling of language acquisition, especially for infants, the seminal work of Goldwater (2006) initiated a rich line of research using nonparametric Bayesian language models. This work demonstrates that these models lead to better word segmentation performance when compared to older unsupervised techniques. They are also attractive for several other reasons:

they are well-formed probabilistic generative models, and therefore interpretable; they define distributions with a non-finite number of possible outcomes, a desirable property when modeling natural language lexicons: these are not closed sets, and it is diﬃcult to make assumptions regarding the size of a lexicon for an unknown language;

they are crucially able to produce power law (“Zipfian”) distributions over words, a universal prior for natural languages; this is achieved using “rich-get-richer” stochastic processes, in which the more frequent an outcome of the process is, the more likely it is to be generated again in the future.

they are able to adapt the number of their parameters17 to the quantity of data available; in other words, these language models are naturally smoothed and less prone to overfitting;

they benefit from robust inference schemes (Gibbs sampling, Variational Infer-ence) and do not require the design of ad hoc search heuristics to be computa-tionally tractable

**Stochastic processes**

We introduce, in this section, a series of stochastic processes necessary to define sev-eral nonparametric Bayesian language models we use later on in this thesis. We keep notations from Goldwater (2006), our main source for this presentation.

Chinese restaurant process An important stochastic process for nonparametric Bayesian language models is the so-called Chinese restaurant process (CRP), which generates partitions of integers. The analogy goes as follows: each customer i (repre-sented as an integer) sequentially enters a restaurant with an infinite number of tables, each table accommodating a potentially infinite number of customers. When customer i enters, an arrangement z i of the previous customers is observed, with K(z i) non-empty tables, each already accommodating nk(z i) customers for k 2 [1; K(z i)]. The customer either seats at a non-empty table with probability P (zi = k j z i), or chooses a new one with probability P (zt = K(z i)+1 j z i). These terms are defined as follows:

with 0, a parameter of the process called the concentration19 parameter. Larger values for this parameter result in a tendency towards opening more new tables, hence a more uniform distribution of customers across the tables, and more clusters in the partition produced. It is also clear that a “rich-get-richer” eﬀect will ensue from this definition of the CRP, and that this eﬀect will get stronger as gets smaller.

In an alternative view of the Dirichlet process, a stick-breaking process makes the fact that the DP generates discrete distributions with a countably infinite support more explicit. In this view, the base distribution of the DP distributes independently the locations of the probability mass function. The parameter, on its end, influences the probability of each of these locations: a series of independent random variables k are drawn sequentially from a Beta(1; ) distribution; 1 breaks the unit “stick”, and this portion is the probability mass for the first location drawn from the base distribution; 2 breaks the remaining portion of the stick, and this defines the probability mass of the second location, etc. This ensures that the total probability mass will be 1. This also gives another intuition as to why small values of will lead to more “concentrated” probability mass and sparser distributions.

Another intuitive way to understand the Dirichlet process is to look at it as a “two-stage” CRP model, in which 1) customers are seated according to a CRP process with a certain concentration parameter as defined by Equation (2.6), and 2) each new opened table is then labelled with a draw from a distribution G0. This two-stage CRP model is equivalent20 to a Dirichlet process with concentration parameter and base distribution G0. Goldwater (2006) calls the CRP in the first step the adaptor, and the distribution G0 in the second step the generator.

Pitman-Yor process A Pitman-Yor process PYP( ; ; G0) is a generalization of the Dirichlet process DP( ; G0). In the two-stage view of the DP, the CRP adaptor is modified in such a way that the conditional probability for the ith customer to seat at table k is defined by:

with 0 < 1, > ( corresponds to the concentration parameter in the standard CRP), and K(z i) the number of tables already occupied when the ith customer enters the restaurant. The new parameter, , of the Pitman-Yor process, gives more control over the shape of the tail of the distributions generated by the process. It allows to “save” some probability mass to augment the likelihood of opening new tables, even as the number of customers grows and tends to decrease the probability to open a new table in the equivalent DP. Hence, the parameter is often called the discount parameter of the PYP.

Application to word generation If G0 (the generator in the two-stage view of the DP) corresponds to a distribution defined over a lexicon, this means that tables in the restaurant will be labeled with words and that each customer entering the restaurant will represent a word token. The CRP (the adaptor, responsible for assigning customers to tables) will, on the other hand, control word frequencies according to a power-law distribution. We will define such language models more formally in Section 2.4.3. Note that PYP language models diﬀer from DP language models only in the definition of the adaptor (Equation (2.6) vs. Equation (2.9)).

Sometimes confusing is that the generator in a DP (or a PYP) can generate du-plicated labels, in other words multiple tables can share the same label in the Chinese restaurant analogy. In fact, the expected number of CRP tables (in a DP) for a type corresponding to n tokens is log n+ (Antoniak, 1974, cited by Goldwater (2006)). For a given number of tokens corresponding to a particular type, the average number of tables labelled by that type grows with . This intuitively agrees with the dispersion eﬀect of already mentioned, with greater values of leading to the opening of more tables.

20The technical explanation can be found in Section 3.6 of (Goldwater, 2006).

**Sampling**

Bayesian approaches are interested in taking into account the full posterior distributions for the parameters of the model instead of a point-wise estimate (like MAP or ML estimates) usually obtained via an EM procedure when supervision data are lacking. Since the posterior distribution is usually impossible to express analytically, sampling algorithms known as Markov Chain Monte Carlo (MCMC) methods are used. Following (Goldwater, 2006), we give the main ideas these sampling methods are built upon.

MCMC methods Such sampling algorithms involve building a Markov chain, with random states Y 1 : : : Y T and transition matrix P, with proper conditions ensuring its convergence to a unique stationary distribution over the states satisfying P = . The states of the Markov chain correspond to an assignment of values to the random variables we want to sample from, and the state space of the Markov chain corresponds to the hypothesis space of the model.

Proper construction of P guarantees will be the distribution we are interested to infer. After convergence at time Tc, 8t Tc, Y t will be a sample of the distribution of interest. The conditions on the Markov chain are the following:

the chain needs to be irreducible (existence of a finite path with non-zero proba-bility between all pairs of states);

the chain also needs to be aperiodic (together with the preceding condition, this defines an “ergodic” chain);

P needs to satisfy P = when is the distribution we want to sample from. This is the “general balance” condition.

Gibbs sampling A particular algorithm build on these principles is Gibbs sampling. If we decompose each state variable Y t into its K components Y1t : : : YKt corresponding to diﬀerent variables in the model, each iteration of this sampler corresponds to K steps. In each of these steps, the kth component Ykt is sampled from its conditional distribution given the current values of all the other components.

To guarantee ergodicity, we need to avoid cases where the conditional probabilities take null values – for example when changing the value of a random variable without also changing the value of another variable could lead to a state with zero probability. To address this problem, it is possible to “block” a Gibbs sampler, sampling a block of variables at once instead of separately, which can also improve convergence speed.

A number of samples are typically ignored at the beginning of the sampling process, during the “burn-in” period necessary for the Markov chain to converge. Importantly, successive samples will be correlated and researchers often retain only a fraction of non-neighbouring samples, or average their results over distinct runs of the Gibbs sampler.

**NONPARAMETRIC BAYESIAN MODELS **

Note that exchangeability is related, but distinct, to the concept of a series of independent and identically distributed (i.i.d.) random variables.

A sequence of variables distributed according to a distribution itself drawn from a Dirichlet Process has this property of exchangeability, which is crucial to perform inference using Gibbs sampling eﬃciently: any assignment of a component can be made under the assumption that this component is the last one in the sequence. This way, one can avoid recomputing counts for the part of the sequence occurring after the currently assigned component.

**Goldwater’s language models**

In this section, we formally present two language models based on the Dirichlet process and introduced by Goldwater et al. (2006a). These models will turn out to be very strong baselines for word segmentation in our experiments (see Chapter 3).

Unigram model The first model proposed by Goldwater et al. (2006a) is a unigram language model. The unigram assumption usually means that the terms from the prod-uct in Equation (2.1) are approximated by P (wi j w1; : : : ; wi 1) , P (wi), and that the probabilities of words appearing in a sequence are independent of each other. In the present case, the independence assumption is conditional to a given draw from a DP.

In Goldwater’s unigram model indeed, the words are distributed according to a draw G1 from a Dirichlet process DP( 1; G0),22 with 1 0 a parameter of the model, and G0 a (uniform) unigram distribution over characters (or phonemes).23 We can write these assumptions as:

Unfortunately, as G1 has an infinite support, it is not possible to sample wi from G1. It is possible, however, to integrate over G1 to obtain the conditional probability of wi given the previously generated words (Blackwell and MacQueen, 1973, cited by Goldwater (2006)):

With these posterior distributions, it is possible to calculate a probability for the segmentation of an unsegmented sequence (of characters or phonemes). The Gibbs sampler proposed by Goldwater considers all possible word boundaries, and successively samples between two hypotheses for each boundary position: one where a boundary is present, and the other where the boundary is absent (all other boundaries in the corpus are kept identical). At the end of an iteration of the sampler, all boundary positions in the corpus have been considered once.

Bigram model The limitations of the unigram model, and its tendency to underseg-ment an input character sequence, as it tries to capture frequently co-occurring bigrams as single words, led Goldwater to develop a bigram language model capturing the ef-fect of the context on word generation. In the traditional language model formulation, a bigram dependency means that the terms in Equation (2.1) are approximated by P (wi j w1; : : : ; wi 1) , P (wi j wi 1). In Goldwater’s formulation, the generation of suc-cessive words wi now relies on a more involved generative process involving a hierarchical Dirichlet process (Teh, 2006), or HDP, that we describe now.

The bigram distributions P (wi j wi 1 = w; Gw), for each word form w in the left context (or history), are distributed according to a draw Gw from a Dirichlet process DP( 2; G1). The base distribution, G1, of this DP is itself drawn from a Dirichlet pro-cess DP( 1; G0). Lastly, identically to the unigram model, G0 is a unigram distribution over characters.

26 Ignoring again the sentence boundaries. Note that, for simplicity, we abuse the notation, as we denote by wi 1 both the random variable and its realization.

**NONPARAMETRIC BAYESIAN MODELS **

with n(wi 1;wi)(w i) the number of bigram tokens (wi 1; wi), nwi 1 (w i) the number of tokens wi 1,27 twi (w i; z i) the number of bigram tables labelled wi, and tall(z i) the total number of bigram tables. Here, z i corresponds to the seating arrangement in the bigram restaurants. Pbacko corresponds to the posterior estimate of base distribution G1, and can be seen as a unigram backoﬀ (see Goldwater, 2006, section 5.5.1).

To understand these equations, it is important to realize that in the specification of the model, each word type w has its own restaurant, and that this “bigram restaurant” corresponds to the distribution of tokens following w (Gw DP( 2; G1) in Equa-tion (2.13)). When a new table is opened in a given bigram restaurant, a label is drawn from G1, which corresponds to the so-called “backoﬀ” restaurant (G1 DP( 1; G0) in Equation (2.13)). All this means that customers in the bigram restaurants correspond to bigram tokens, and that customers in the backoﬀ restaurant correspond to labels on bigram tables. Note that in the bigram model, and contrary to the unigram model, the seating arrangement z i matters to calculate the conditional probability of the words.

Similarly to the unigram model (yet with an increased complexity), a Gibbs sampler can be built to infer segmentations from an unsegmented corpus.

**Nested language models**

The hierarchical nature of Goldwater’s bigram model can be further extended to n-gram dependencies. Mochihashi et al. (2009) adopt this stance, and replace the Dirichlet pro-cess by the more general Pitman-Yor process, proposing a model they call the nested Pitman-Yor language model. Under a n-gram version of this language model, the dis-tribution of words follows the hierarchical scheme:

G0, by a second hierarchical Pitman-Yor process, a spelling model, whose m-gram struc-ture is equivalent to that of their language model. To make their nested model tractable, the authors supplement it with a blocked Gibbs sampler, using an eﬃcient forward fil-tering and backward sampling procedure, as well as a Poisson correction for word length. They report faster inference and better accuracy than Goldwater’s bigram model.

A further extension can be found in (Neubig et al., 2010), in which a similarly nested Pitman-Yor language model is used to learn word segmentation from phoneme lattices.

27 This is also the total number of customers in the bigram restaurant wi 1, in other words all bigram tokens beginning with wi 1.

The main novelty of this work is to reinterpret the model of Mochihashi et al. (2009), i.e. the hierarchical Pitman-Yor language and spelling models, in terms of a weighted finite state acceptor (WFSA) in charge of assigning the proper posterior probability to a given segmentation. This acceptor can be composed with a phoneme lattice encoding acoustic model scores and with a weighted finite state transducer (WFST) transducing any sequence of phonemes into all of its possible segmentations.

Another generalization (Löser and Allauzen, 2016) introduces some morphotactics through word classes: in this model, sentences are produced by a nonparametric Markov model, where both the number of states and the number of types are automatically adjusted based on the available data. Two hierarchical Pitman-Yor processes are also embedded in this architecture: one for controlling the number of classes (states) and one for controlling the number of words. As in Mochihashi et al. (2009), the base distribution is also a hierarchical PYP spelling model.

**Adaptor Grammars**

We conclude this section with the presentation of the Adaptor Grammar (AG) frame-work, built on the concepts of adaptors and generators introduced in Section 2.4.1 (the “two-stage” view of the Dirichlet process). The goal is to learn and infer structure both at word and subword levels, thus combining the learning of morphological structures with the learning of word dependencies, and ultimately learning how to segment sen-tences into words, and words into morphemes. Our presentation follows (Johnson et al., 2007b) and (Johnson and Goldwater, 2009).

Framework Adaptor Grammars (Johnson et al., 2007b) are an extension to proba-bilistic context-free grammars (PCFGs) relaxing the assumption that each subtree of a nonterminal node is generated independently from other subtrees rooted in the same nonterminal.

Using these adaptors in the recursion, it is possible to allow for a symbol’s expansion to depend on the way it has been rewritten in the past. Informally, Adaptor Grammars are able to “cache” entire subtrees expanding nonterminals and provide a choice to rewrite each new nonterminal either as a regular PCFG expansion or as a previously seen expansion. In this respect, Adaptor Grammars can be seen as a nonparametric extension of PCFGs. Moreover, using adaptors based on the Dirichlet process or the Pitman-Yor process, one can build models capturing power-law distributions over trees and subtrees.

Inference The training data in the unsupervised context consists only in terminal strings (yields of trees rooted in the start symbol S). In order to sample posterior distri-butions over analyses produced by a particular Adaptor Grammar based on Pitman-Yor processes, Johnson et al. (2007b) devise a method relying on a Markov chain Monte Carlo algorithm (see Section 2.4.2) together with a PCFG approximation30 of the Adap-tor Grammar. The idea for this approximation is, for each analyzed string, to add to the rules of the “base” PCFG all the production rules corresponding to the yields of the adapted nonterminals in the Adaptor Grammar, given all the analysed strings in the data set, except the currently analysed string. One can sample analyses from this PCFG using the algorithm described in (Johnson et al., 2007a).

**Table of contents :**

**1 Introduction **

1.1 Motivation: language endangerment

1.1.1 Magnitude of the issue

1.1.2 Consequences of language loss

1.1.3 Response of the linguistic community

1.2 Computational language documentation

1.2.1 Recent work

1.2.2 The BULB project

1.2.3 Challenges

1.3 Scope and contributions

1.3.1 Unsupervised word discovery

1.3.2 Outline of the thesis

1.3.3 Contributions

1.3.4 Author’s publications

**2 Background **

2.1 Word segmentation and alignment

2.1.1 Two sides of the same problem

2.1.2 Evaluation

2.1.3 Remarks

2.2 Early models for unsupervised string segmentation

2.2.1 Pioneer work

2.2.2 Multigrams

2.2.3 Minimum description length principle

2.3 Learning paradigms

2.3.1 Signatures

2.3.2 Signatures as finite state automata

2.3.3 Paradigms

2.4 Nonparametric Bayesian models

2.4.1 Stochastic processes

2.4.2 Sampling

2.4.3 Goldwater’s language models

2.4.4 Nested language models

2.4.5 Adaptor Grammars

2.5 Automatic word alignment

2.5.1 Probabilistic formulation

2.5.2 A series of increasingly complex parameterizations

2.5.3 Parameters estimation

2.5.4 Alignments extraction

2.6 Joint models for segmentation and alignment

2.6.1 Segment, then align

2.6.2 Jointly segment and align

2.7 Conclusion and open questions

**3 Preliminary Word Segmentation Experiments**

3.1 Introduction

3.1.1 A favorable scenario

3.1.2 Challenges for low-resource languages

3.2 Three corpora

3.2.1 Elements of linguistic description for Mboshi and Myene

3.2.2 Data and representations

3.3 Experiments and discussion

3.3.1 Models and parameters

3.3.2 Discussion

3.4 Conclusion

**4 Adaptor Grammars and Expert Knowledge**

4.1 Introduction

4.1.1 Using expert knowledge

4.1.2 Testing hypotheses

4.1.3 Related work

4.2 Word segmentation using Adaptor Grammars

4.3 Grammars

4.3.1 Structuring grammar sets

4.3.2 The full grammar landscape

4.4 Experiments and discussion

4.4.1 Word segmentation results

4.4.2 How can this help a linguist?

4.5 Conclusion

**5 Towards Tonal Models **

5.1 Introduction

5.2 A preliminary study: supervised word segmentation

5.2.1 Data and representations

5.2.2 Disambiguating word boundaries with decision trees

5.3 Nonparametric segmentation models with tone information

5.3.1 Language model

5.3.2 A spelling model with tones

5.4 Experiments and discussion

5.4.1 Representations

5.4.2 Tonal modeling

5.5 Conclusion

**6 Word Segmentation with Attention **

6.1 Introduction

6.2 Encoder-decoder with attention

6.2.1 RNN encoder-decoder

6.2.2 The attention mechanism

6.3 Attention-based word segmentation

6.3.1 Align to segment

6.3.2 Extensions: towards joint alignment and segmentation

6.4 Experiments and discussion

6.4.1 Implementation details

6.4.2 Data and evaluation

6.4.3 Discussion

6.5 Conclusion

**7 Conclusion**

7.1 Summary

7.1.1 Findings

7.1.2 Synthesis of the main results for Mboshi

7.2 Future work

7.2.1 Word alignment

7.2.2 Towards speech

7.2.3 Leveraging weak supervision

7.3 Perspectives in CLD