Get Complete Project Material File(s) Now! »

## autoencoder-based approaches

There are nonetheless approaches which try to generate sequences all at once rather than note by note. The methods that we describe in this section try to encode a whole sequence s into one point (or latent space representation z) in a space of small dimensionality. This latent space can be considered as the space of all (valid) sequences. A map-ping (decoder) from the latent space to the space of sequences is then introduced in order to generate sequences given a latent space vari-able z. The encoding and decoding functions are jointly learned: the aim is to perfectly reconstruct the sequence which has been encoded. Since the latent space is of smaller dimensionality than the space of all sequences, data-relevant codes and efficient decoding functions must be found. This is the original idea motivating the autoencoder architecture and its refinements [13, 153].

When transposed into the context of sequence generation, the en-coding and decoding functions are often implemented using RNNs which are convenient when dealing with sequential data (see Fig. 15). Sampling from an autoencoder is easily implemented: it suffices to draw a random latent variable z and to decode it in order to get a meaningful sequence. However, this sampling scheme is not satisfac-tory since we have no guarantee that we sample sequences with the correct probabilities.

### existing approaches on polyphonic music generatio n

In practice, an interesting model for polyphonic music generation should satisfy three requirements: statistical accuracy (capturing faith- fully statistics of correlations at various ranges, horizontally and ver-tically), flexibility (coping with arbitrary user constraints), and general-ization capacity (inventing new material, while staying in the style of the training corpus).

Models proposed so far fail on at least one of these requirements. In [46], the authors propose a chord invention framework. However, this framework is not based on agnostic learning, and requires a hand-made ontology. The approach described in [117] consists in a dynamic programming template enriched by constrained Markov chains. This approach generates musically convincing results [116] but is ad hoc and specialized for jazz. Furthermore it cannot invent any new voic-ing (the vertical ordering of the notes in a chord) by construction. In [66] and [4], the authors describe an approach using Hidden Markov Models (HMMs) trained on an annotated corpus. This model imitates the style of Bach chorales and the authors report good cross entropy measures. However, the described model is also not able to produce new voicings but can only replicate ones that are found in the training corpus. Another related approach is [76], which uses HMMs on spe-cific hand-crafted chord representations to generate homorhythmic sequences. These representations are based on an expert knowledge of the common-practice harmony and are called General Chord Type (GCT) [22]. A drawback of these models is that they are not agnostic, in the sense that they include a priori knowledge about music such as the concept of dissonance, consonance, tonality or scale degrees.

Agnostic neural-network based approaches have been investigated with promising results. We presented the architectures as well as the pros and cons of these models in Sect. 4.1.2.1 and we refer the reader to this section. In short, the drawbacks of these models are that they require large and coherent training sets which are not always avail-able. More importantly, how to enforce additional user constraints (flexibility) it is not clear and their invention capacity is not demon-strated.

In this chapter we introduce a graphical model based on the max-imum entropy principle for learning and generating polyphonic mu-sic. Such models have been used for music retrieval applications [123], but never, to our knowledge, for polyphonic music generation. This model requires no expert knowledge about music and can be trained on small corpora. Moreover, generation is extremely fast.

We show that this model can capture and reproduce pairwise statis-tics at possibly long range, both horizontally and vertically. These pairwise statistics are also able, to some extent, to capture implic-itly higher order correlations, such as the structure of 4-note chords. The model is flexible, as it allows the user to post arbitrary unary constraints on any voice. We also show that this model exhibits a re-markable capacity to invent new but “correct” chords. In particular we show that it produces harmonically consistent sequences using chords which did not appear in the original corpus.

In Sect. 5.3 we present the model for n-parts polyphony generation. In Sect. 5.4.2, we report experimental results about chord invention. In Section 5.4.4 we discuss a range of interactive applications in mu-sic generation. Finally, we discuss how the “musical interest” of the generated sequences depends on the choice of our model’s hyperpa-rameters in Sect. 5.4.5.

#### autoe ncoder-base d approaches

There are nonetheless approaches which try to generate sequences all at once rather than note by note. The methods that we describe in this section try to encode a whole sequence s into one point (or latent space representation z) in a space of small dimensionality. This latent space can be considered as the space of all (valid) sequences. A map-ping (decoder) from the latent space to the space of sequences is then introduced in order to generate sequences given a latent space vari-able z. The encoding and decoding functions are jointly learned: the aim is to perfectly reconstruct the sequence which has been encoded. Since the latent space is of smaller dimensionality than the space of all sequences, data-relevant codes and efficient decoding functions must be found. This is the original idea motivating the autoencoder architecture and its refinements [13, 153].

When transposed into the context of sequence generation, the en-coding and decoding functions are often implemented using RNNs which are convenient when dealing with sequential data (see Fig. 15). Sampling from an autoencoder is easily implemented: it suffices to draw a random latent variable z and to decode it in order to get a meaningful sequence. However, this sampling scheme is not satisfac-tory since we have no guarantee that we sample sequences with the correct probabilities.

**Table of contents :**

1 introduction

1.1 Motivations

1.2 Contributions

**i overview **

**2 musical symbolic data**

2.1 Symbolic music notation formats

2.1.1 The modern Western musical notation format .

2.1.2 Markup languages

2.1.3 ABC notation

2.1.4 MIDI

2.2 Singularities of the symbolic musical data

2.2.1 Melody, harmony and rhythm

2.2.2 Structure, motives and patterns

2.2.3 Style

2.3 Symbolic Music Datasets

2.3.1 Monophonic datasets

2.3.2 Polyphonic datasets

2.3.3 MIDI file collections

2.3.4 The Chorale Harmonizations by J.S. Bach

**3 challenges in music generation **

3.1 Building representations

3.1.1 Notes

3.1.2 Rhythm

3.1.3 Melodico-rhythmic encoding

3.2 Complexity of the musical data

3.3 Evaluation of generative models

3.4 Generative models for music, what for?

**4 deep learning models for symbolic music generation **

4.1 Sequential Models

4.1.1 Models on monophonic datasets

4.1.2 Polyphonic models

4.2 Autoencoder-based approaches

4.2.1 Variational Autoencoder for MIDI generation .

**ii polyphonic music modeling **

**5 style imitation and chord invention in polyphonic music with exponential families **

5.1 Introduction

5.2 Existing approaches on polyphonic music generation .

5.3 The model

5.3.1 Description of the model

5.3.2 Training

5.3.3 Generation

5.4 Experimental Results

5.4.1 Style imitation

5.4.2 Chord Invention

5.4.3 Higher-order interactions

5.4.4 Flexibility

5.4.5 Impact of the regularization parameter

5.4.6 Rhythm

5.5 Discussion and future work

**6 deepbach: a steerable model for bach chorales generation **

6.1 Introduction

6.2 DeepBach

6.2.1 Data Representation

6.2.2 Model Architecture

6.2.3 Generation

6.2.4 Implementation Details

6.3 Experimental Results

6.3.1 Setup

6.3.2 Discrimination Test: the “Bach or Computer” experiment

6.4 Interactive composition

6.4.1 Description

6.4.2 Adapting the model

6.4.3 Generation examples

6.5 Discussion and future work

**iii novel techniques in sequence generation **

**7 deep rank-based transposition-invariant distances on musical sequences **

7.1 Introduction

7.2 Related works

7.3 Corpus-based distance

7.3.1 Rank-based distance

7.3.2 Sequence-to-Sequence autoencoder

7.3.3 ReLU non-linearity and truncated Spearman rho distance

7.4 Transformation-invariant distances

7.5 Experimental results

7.5.1 Implementation details

7.5.2 Nearest neighbor search

7.5.3 Invariance by transposition

7.6 Conclusion

**8 interactive music generation with unary constraints using anticipation-rnns **

8.1 Introduction

8.2 Statement of the problem

8.3 The model

8.4 Experimental results

8.4.1 Dataset preprocessing

8.4.2 Implementation details

8.4.3 Enforcing the constraints

8.4.4 Anticipation capabilities

8.4.5 Sampling with the correct probabilities

8.4.6 Musical examples

8.5 Conclusion

**9 glsr-vae: geodesic latent space regularization for variational autoencoder architectures **

9.1 Introduction

9.2 Regularized Variational Autoencoders

9.2.1 Background on Variational Autoencoders

9.2.2 Geodesic Latent Space Regularization (GLSR) .

9.3 Experiments

9.3.1 VAEs for Sequence Generation

9.3.2 Data Preprocessing

9.3.3 Experimental Results

9.4 Implementation Details

9.5 Choice of the regularization parameters

9.6 Discussion and Conclusion

**a examples of generated music **

a.1 Non interactive generations

a.2 Interactive generations

**b résumé de la thèse **

b.1 Introduction

b.2 Contributions

b.3 Données musicales, challenges et critique de l’état de l’art

b.3.1 Données musicales symboliques

b.3.2 Les challenges de la génération de musique symbolique

b.3.3 Les modèles génératifs profonds pour la musique symbolique

b.4 Modélisation de la musique polyphonique

b.4.1 Familles exponentielles pour l’imitation du style et l’invention d’accords dans la musique polyphonique

b.4.2 DeepBach: un modèle contrôlable pour la génération de chorals

b.5 Techniques nouvelles pour la génération séquentielle .

b.5.1 Distances de rang invariantes par transposition pour des séquences musicales

b.5.2 Anticipation-RNN: Génération interactive de musique sujette à des contraintes unaires

b.5.3 GLSR-VAE: Régularisation géodésique de l’espace latent pour les auto-encodeurs variationnels

**bibliography**