Hierarchical attention modulation

Get Complete Project Material File(s) Now! »

symbolic music representation

Music as symbol

Music can be described as a set of sounds organized in time, that most of the civilizations have tried to transcribe into written format since very ancient times. Therefore, it is difficult to precisely date the first appearance of musical symbols but it seems that the first score discovered traces back to 1400 BC. This music is engraved in cuneiform writing on clay tablets (see Figure 2.1). Even though the in-terpretation of this notation system is still under debate, it is clear that it provides instructions for performing music. We also learn from its reading that the music was composed in harmonies of thirds and using a diatonic scale (Bosseur, 2005).
In Ancient Greece, the foundations of one of the first notation specifically tai-lored for transcribing music has been laid by the music theorist Alypius. He cre-ated two alphabets – one is dedicated to vocals and the other to instruments – where the letters represent the notes and are distorted to emphasize musical vari-ations (see Figure 2.1). The use of such a notation already required a significant amount of knowledge that very few people had access to, leading to the creation of a very simplified system for the daily practice of music composed only of syl-lables. Here, we can assume that this has ushered in the emergence of a schism between the so-called « art music » and the popular music.
Although taken up and completed by the Romans and then by the Byzantines, we had to wait until the beginning of the Middle Ages to see a significant break-through in musical notation with the appearance of the neumes. These symbols take the form of musical figures applied on syllables that do not encode indi-vidual notes but simple melodies. At first they do not indicate precise intervals between notes, only grave or acute accents allow to differentiate the pitch. Then in the course of the 10th century, the idea of drawing a line representing a fixed sound, above and below which the neumes were ordered, make its appearance (see Figure 2.1). Fifty years later, a red line represented the F, a yellow one the Ut, these were the first musical stave lines.
The Middle Ages were marked by two notable periods corresponding to distinct compositional styles. The first one, called « ars antiqua », has witnessed the rise of the polyphony where several instruments play simultaneously. At this point, it be-came a necessity to accurately transcribe pitches for the global harmony between voices and rhythm for the coordination between instrumentalists. Therefore, the number of stave lines has increased to five, as it is today, except for the Gregorian chants where the smaller pitch range can be handled with only four lines. Single notes are now represented by little black squares arranged on the staff, with or without tail according to their duration.
Until then, the ternary rhythms (which divide time by three, creating a revolv-ing, waltzing pulse) are predominant since they refer to the Holy Trinity 1. How-ever, during the « ars nova », the second major artistical trend of the Middle Ages, the rhythm begins to be theorized thus triggering a further improvement in the notation system (Apel, 1961). Indeed, the written note now takes different shapes – square, rectangle, diamond – based on its duration. These enhancements have democratized the binary rhythms which divide the time by two, giving a steady and regular character to the music.
However, we have to wait until the 15th century for a real standardization of the music writing with the invention of the printing process. The square notes will give way to round notes that are more suitable for engraving with a chisel. We also see the emergence of the bar lines that give rhythm a central role and rein-force the mathematical aspect of the music.
In the centuries that followed, Western musical notation became more complex and spread throughout the world which made possible to fix in writing traditional musics hitherto transmitted orally. However, this had the consequence of denatur-ing some of them, as in China for example, music written with European rules sounded much more like Western music than traditional Chinese music.
We can see that the representation of music as symbols itself has been a central question in the history of music. In that sense, musical notation could be thought of as a model which enables us to reason and think about music.
Figure 2.2: Representations for symbolic music learning from scores (a). The piano-roll (b) is the most widespread representation. The MIDI-like (Oore et al., 2018) repre-sentation (c) encodes it as a sequence of events, while the NoteTuple (Hawthorne et al., 2018) representation (d) encodes the time offset, MIDI pitch number, ve-locity and two values for the duration.

Symbolic music representations for computer science

Nowadays, with the apparition of the digital era, multiple machine-readable scores formats have been developed, the most prominent being the Musical Instrument Digital Interface (MIDI). This type of digital data format allows to treat music through its symbolic representation since it is based on a finite alphabet of symbols. A MIDI file encodes information for the different notes, durations and intensities through numerical messages with a pre-defined temporal quantification (a subdi-vision of a quarter note). Hence, this format has been widely used in computer music research as it allows a compact representation of music. Other formats have been developed using for instance Locator/Identifier Separation Protocol (LISP) (Assayag et al., 1999) or eXtensible Markup Language (XML) (Good et al., 2001).
In order to fully benefit from the growing computational capability of the mod-ern machines, these digital scores have to be encoded into more algebraic struc-tures like vectors and matrices. Some representations are specific to a precise task, like the couple representation (Hadjeres, Pachet, and Nielsen, 2017) used for rep-resenting four-voices chorales written by Jean-Sebastian Bach. Despite the tremen-dous generation results obtained from it, we focus in this chapter on more generic proposals which may fit well for any kind of musical data. We present here the three main representations of the literature.


The most common way to represent polyphonic music is through the piano-roll representation. Here, time is discretized with a reference quantum (typically a fraction of the quarter note) to provide a matrix P(n, t) that represents note activa-tion in musical sequences. An example is depicted Figure 2.2. This representation fits for every kind of music as it properly handles polyphony. However, the result-ing matrices are relatively high dimensional (88 to 128 dimensions per time step, per voice) and highly repetitive due to its discrete nature. Moreover, because of the typically small amount of notes played simultaneously, these matrices are usually highly sparse.
For these reasons, this representation raises several issues for learning, which war-ranted the definition of alternate approaches.


The first alternative representation has been proposed in Oore et al., 2018. This MIDI-like approach relies on an event-based vocabulary composed by four main MIDI events. The NOTE_ON event (composed of 128 possible sub-events) indicates the start of the corresponding MIDI note. Similarly, the NOTE_OFF event signifies the end of a played note. The TIME_SHIFT event is composed by 125 sub-events and moves the time step forward by increments of 8 ms up to 1 second. Finally, the SET_VELOCITY event counts 32 sub-events which changes the velocity applied to all subsequent notes until the next velocity event. Hence, the resulting representation of an input piece is a variable-length sequence of discrete events taken from this vocabulary. We can see an example Figure 2.2 This representation can handle any form of music with polyphony and variable number of voices or time signatures. However, as the MIDI-like representation relies on the idea of time shifts, all the attributes corresponding to a given note (velocity, note ON and note OFF) may be encoded at very distant positions of a sequence which could again stirs up some issues for learning systems.

READ  The High Strength Friction Grip (HSFG) bolts


To alleviate this particular issue, the NoteTuple representation (Hawthorne et al., 2018) was recently proposed. In this method, each note is represented by a tuple composed by four attributes, namely, the time offset from the previous note, pitch, velocity and duration. The encoding of each attribute is categorical, with its own vocabulary instead of a large shared one (as in MIDI-like). As the time offset and duration vocabularies can potentially be very large, both are separated into major and minor tick fields. The time shift attribute counts 13 major and 77 minor ticks, representing 0 through 10 seconds and the duration attribute counts 25 major and 40 minor tick values. The result is a tuple containing six elements for each note. For notes played simultaneously, tuples are listed by order of increasing pitches as shown in the example Figure 2.2.

Musical spaces

One of the core research question in computer music remains to find an adequate representation for the relationships between musical objects. Indeed, the transcription of music as symbols usually fails to provide information about harmonic or timbral relationships. In that sense, we can loosely say that the goal would be to find a target space, which could exhibit such properties between musical enti-ties. Hence, finding representations of musical objects as spaces has witnessed a flourishing interest in the scientists community (Ukkonen, Lemström, and Mäki-nen, 2003; Bigo, Giavitto, and Spicher, 2011; Typke, Veltkamp, and Wiering, 2004; Uitdenbogerd and Zobel, 1999). Many of the formalizations proposed over the past decades entail an algebraic nature that could allow to study combinatorial properties and classify musical structures. Here, we delimit a distinction between these methods into two types of representations: the rule-based and the agnostic approaches.
In the rule-based stream of research, several types of spaces have been devel-oped since the Pythagoreans. Indeed, Marin Mersenne allowed to discover many algrebraic and geometric structure in classical music through his circular repre-sentation of the pitch space in the 17th century (Mersenne, 1972; Mesnage, 1997; Vieru, 1995). Many years later Henry Klumpenhouwer present a new space for rep-resenting music called the K-nets (Klumpenhouwer, 1991). This approach leaded to reveal some structural aspects in music through the many isographies of the net-works (Lewin, 1994; Perle, 1996; Lewin, 1990). Finally, we can cite the well-known Tonnetz, invented by Euler in the 18th century (Euler, 1739) where the symbolic pitches are geometrically organized in an Euclidean space defined by infinite axes associated with particular musical intervals. We can see examples of these differ-ent spaces Figure 2.3.
In a mathematical point of view, all these representations are equivalent meth-ods of formalizing the structural properties of the equal-tempered system (i.e. the division of the octave into twelve equal intervals). They provide a novel and pow-erful tool for the analysis of harmonic progression. Their combinatorics nature has aroused a lot of composition treatises and techniques which are valued for the pedagogical benefits they offered to students by transmitting knowledge in the manipulation of musical materials (Bigo et al., 2015). Moreover, these models have proven their ability to enhance human creativity, since they have been used in contemporary music for example in the composition of the so-called Hamiltonian songs (Bigo and Andreatta, 2014). These were created by following the 124 possi-ble Hamiltonian cycles – which refer in graph theory to a path passing through all possible nodes and ending precisely where it start – present in the Tonnetz (Albini and Antonini, 2009).
There are two main benefits of this type of rule-based approach. First, once the model is built, it can be straightforward to analyze some of its properties (based on the defined sets of rules). Second, we can also understand the scope where the model should be efficient based on its construction. But as it is defined, a rule-based approach represent the particular vision of the designer that has thought and crafted the corresponding rule sets. Hence, the corresponding musical spaces will provide a given set of interactions. It is interesting to ask if we could develop a more empirical discovery of these spaces that could provide more generic musical relationships. These kinds of spaces could allow to exhibit properties in musical scores in a way that we never would have thought. In doing so, we could then find some new relevant features and metric relationships between musical entities and develop innovative applications. Hence, in the following, we consider that the important properties of a space are not necessarily its dimensions (like in the rule-based approaches), but rather the metric relationships or distances between objects inside this space (like in the agnostic approach that we seek to develop).
However, we remain conscious of the limitations of such agnostic spaces. In-deed, these are still indirectly the product of our design of the learning algorithms. Furthermore, they might be highly dependent on the dataset used for their con-struction. Finally, there might be no direct ways to analyze their properties nor prove their efficiency.

machine learning tools

In the following section, we propose an overview of all the machine learning con-cepts that have been used during this thesis. First of all, we propose a formal description of what has been named machine learning by laying the necessary bases for an algorithm to be able to learn. Then, we detail the training procedure and the notions of artificial neurons and neural networks. Finally, we develop the ar-chitectures and specificities of all models and techniques useful for our work.


Formal description

Learning can be defined as the process of acquiring, modifying or reinforcing knowledge by discovering new facts and theories through observations (Gross, 2015). To succeed, learning algorithms need to grasp the generic properties of different types of objects by « observing » a large amount of examples. These ob-servations are collected inside training datasets that supposedly contain a wide variety of examples. There exists three major types of learning :
• Supervised learning: Inferring a function from labeled training data. Every sample in the dataset is provided with a corresponding groundtruth label.
• Unsupervised learning: Trying to find hidden structure in unlabeled data. This leads to the important difference with supervised learning that correct input/output pairs are never presented. Moreover, there is no simple evalu-ation of the models accuracy.
• Reinforcement learning: Acting to maximize a notion of cumulative reward. This type of learning was inspired by behaviorist psychology. The model receives a positive reward if it outputs the right object and a negative one in the opposite.
In computer science, an example of a typical problem is to learn how to classify elements by observing a set of labeled examples. Hence, the final goal is to be able to find the class memberships of given objects (e.g. a sound played either by a pi-ano, an oboe or a violin). Mathematically we can define the classification learning problem as follows.

Table of contents :

1 introduction 
1.1 Motivations
1.2 Dissertation organization and main contributions
2 overview 
2.1 Introduction
2.2 Symbolic music representation
2.2.1 Music as symbol
2.2.2 Symbolic music representations for computer science
2.2.3 Musical spaces
2.3 Machine learning tools
2.3.1 Basics
2.3.2 Specific tools
2.4 Embedding spaces
2.4.1 Apparition and formalism
2.4.2 Successful models
2.4.3 Space representation
2.5 Symbolic musical spaces
2.5.1 Prediction-based
2.5.2 VAE-based
2.6 Conclusion
3 prediction-based framework 
3.1 Introduction
3.2 CNN-LSTM model
3.2.1 Motivations
3.2.2 Architecture
3.2.3 Hierarchical attention modulation
3.2.4 Data and training
3.3 Method evaluation
3.3.1 Prediction results
3.3.2 Embedded data visualization
3.4 Conclusion
4 vae-based framework 
4.1 Introduction
4.2 Motivation
4.3 Polyphonic music representations
4.3.1 The signal-like representation
4.3.2 Benchmark
4.4 Spaces evaluation
4.4.1 Musical analysis
4.4.2 Results
4.5 Conclusion
5 applications 
5.1 Introduction
5.2 Composers classification
5.2.1 Settings
5.2.2 Discussion
5.3 Creativity support tool
5.3.1 Attribute vector arithmetic
5.3.2 Interpolation
5.3.3 Discussion
5.4 Conclusion
6 conclusion 
6.1 Summary and discussion
6.2 Future works
6.3 Overall conclusion


Related Posts