Architecture of the Convolutional Neural Network

Get Complete Project Material File(s) Now! »

handcrafted systems

Early MIR systems encoded domain knowledge (audio, auditory perception and musical knowledge) by hand-crafting signal processing and statistical models. Data were at most used to manually tune some parameters (such as filter fre-quencies or transition probabilities). Starting from digital signal processing fun-damentals, we present here the different knowledge-based tools used for rhythm description, tempo estimation and genre classification. We also present the main works dedicated to these tasks. It should be noted that these are among the most explored in the field. To go into details or to have a more precise vision of the state of the art concerning them, good overviews have been published: (Gouyon et al., 2006; Zapata and Gómez, 2011; Peeters, 2011) for rhythm description and (Aucou-turier and Pachet, 2003; Ramírez and Flores, 2019) for music genre classification.

Signal Fundamentals

We only present here the main elements of signal processing that we have used in this thesis. For a more detailed description, many reference books or articles can be consulted such as (Rabiner and Gold, 1975) and more recently (Muller et al., 2011) for a specific application to music.
discrete-time signal The basic element of any computer music operation is the audio signal. It is the signal that contains the acoustic information that we perceive as sound. It allows the transmission, the storage, the creation and, in the case of MIR, the analysis of music. Music is a continuous-time audio signal. That is, for a given signal, there is a function f : R ! R such that each point in time t 2 R has an amplitude f(t) 2 R. Due to the inherent constraints of digital systems, the signal cannot be processed in its continuous form. It has to be discretized (i.e. converted to a finite number of points). To do so, two operations are applied to the audio signal: sampling and quantization. It can be done using equidistant sampling (or T-sampling), given a continuous-time signal f, and a positive real number T > 0, the discrete-time signal x can be define as a function x : Z ! R following: x(n) := f(n T) (2–1). for n 2 Z. Here T is the sampling period, its inverse is computed to obtain the sampling rate in Hertz (Hz): Fs = 1=T (2–2).

Temporal representations

Analyzing the rhythmic structure often means studying the periodicity of events related to the rhythm, and in this case the onsets. Also to estimate the tempo from an OEF it is necessary to determine its dominant pulse i.e. the periodic element with the highest energy. The comb filter bank, the DFT and the Auto-correlation Function (ACF) are periodicity representations while others representation such as similarity matrix, scale transform and modulation spectrum also allows to repre-sent rhythm.
comb filter banks. Scheirer (1998) proposes the use of band-pass filters combined with resonating comb filters and peak picking to estimate the domi-nant pulse positions and so on the tempo. Klapuri, Eronen, and Astola (2006) also use resonating comb filter banks driven by band-wise accent signals The main extension they propose is the tracking of multiple metrical levels. dft. The DFT (presented in Section 2.3.1) has been also used for tempo estima-tion (Xiao et al., 2008; Grosche and Müller, 2009) It has been applied to the OEF in (Holzapfel and Stylianou, 2011; Peeters, 2011) or to others representation (Klapuri, Eronen, and Astola, 2006) for rhythm description. acf. ACF is the most commonly used periodicity function. It is calculated as follows: NX-1 ACF(m) = x(n) x(n – m) (2–7).

Musical genre handcrafted features

Early automatic genre classification systems were based on twofold procedure. A step of handcrafted feature extraction and a step of classification (relying on a machine learning algorithms). For feature extraction, a vector of low-level descrip-tor is computed on an audio signal cut into frames using STFT (Section 2.3.1). As mentioned earlier, the classification of the musical genre is based on descriptors re-lated to the characteristic concepts of music: the rhythm, the timbre and the pitch. We introduced numerous features related to rhythm in the previous section We present below an exhaustive list of timbre and pitch related features devoted to automatic genre classification.
timbre related features. The features related to timbre are the most used for automatic genre classification because they are dedicated to the spectral dis-tribution of the signal. In other words, timbre features extracted for every frames encompass the sources (instrument) in the music. Among those, the MFCC which working as approximation to human auditory system are the most used (Tzane-takis and Cook, 2000; Pye, 2000; Deshpande, Nam, and Singh, 2001). Other works use spectral centroid (Tzanetakis and Cook, 2002; Lambrou et al., 1998), spectral flux, zero crossing rate, spectral roll-off (Tzanetakis and Cook, 2002), pitch related features. Ermolinskiy, Cook, and Tzanetakis (2001) use pitch histograms features vector to represent the harmony of an audio signal. Wakefield (1999) developed a « chromogram » that describes the harmonic content of the music and can be used to determine the pitch range of the audio signal. Chroma features are the main features related to pitch. They enable the modeling of melody and harmony assuming that humans perceive different pitches as similar if they are separated by an octave (Müller, 2015). For good overviews of automatic genre classification, early methods refer to (Aucouturier and Pachet, 2003) where the extracted features are described in de-tails. A more recent overview of these methods is presented in (Ramírez and Flores, 2019). They also present all the works that used automatic features learning and deep learning.

READ Ischemic stroke lesional location and functional outcome

Data-driven systems

Data-driven systems use machine learning to acquire knowledge from medium to large-scale datasets. A step of feature extraction from the data is often applied upstream of the training step. It makes the data more suitable for learning depend-ing on the task to be learned. In this section, we first present the machine learning fundamentals. We only describe the tools used in this thesis. Then we list some of the works using machine learning and deep learning to perform the task of automatic tempo estimation and genre classification.

Machine learning fundamentals

Machine learning is a research field that, as its name suggests, encompasses the computer algorithms designed to learn. Contrary to a classical algorithm, devel-oped according to a certain logic, machine learning methods are expected to learn this logic from the data they process. For a given task, they automatically target the patterns contained in the data.
Supervised learning is a paradigm in machine learning that uses labeled data and where the desired output for a given input is specified. It is opposed to unsuper-vised learning for which data labels are not available. Other so-called semi-supervised learning methods can also be used if the data collection is partially labeled. The methods we have developed in this thesis (Chapter 4, 5, 6) all rely on supervised learning.

Supervised learning

principle. A labeled data represents an input data xi tagged by the output response yi that we want the model to find automatically. The learning step of the model is then to identify the patterns in the input data that lead to the desired output. Thus, we want to find a function fS that given a set of input/output pairs S = (x1, y1), (x2, y2), : : : , (xjSj, yjSj) captures the relationships between x and y using controllable parameters : fS(xi, ) = yˆi yi (2–8).
where yˆi is the output of the model. The training pairs (xi, yi) are drawn from a unknown joint probability distribution p(x, y). The goal is to approximate p(x, y) thanks to fS by knowing only S and adjusting the parameters . Thus, based only on the prior knowledge S, the main objective of a trained model is to take unseen input data and correctly determined its output with Equation 2–8. This process is commonly called prediction. With the annotated dataset available, it is common to assume that the annotations are relevant and therefore that yi is the correct label of the input xi. However these annotations, often assigned manually, are subject to potential errors as we will see in the Chapter 3.
variance. Having access to the labels of the data allows to evaluate a trained model by computing its accuracy. However, this does not necessarily reflect real-world performances. The variance error measures the variation in performance for the different sets that we can draw from p(x, y). The variance decreases as the size and representativity of the training data S increases. It is formalized as: Variance = E[(fS( ] – E[f( )])2] (2–9).
where E[f( )] is the expected performance of the model definition function f( ) across all possible draws from p(x, y) and fS( ) the actual performance on S. bias. The bias describes how close f( ) is to the unknown function f that best describes the joint probability distribution p(x, y). It is formalize as: Bias = E[f( )] – f (2–10).

Table of contents :

1 introduction
1.1 Context – Electronic Dance Music
1.1.1 A definition
1.1.2 History and taxonomy
1.1.3 Electronic/Dance Music Musical Characteristics
1.2 Dissertation Organization and main contributions
2 fundamentals and state of the art
2.1 Introduction
2.2 Core Definitions
2.2.1 Rhythm
2.2.2 Musical genres
2.3 Handcrafted systems
2.3.1 Signal Fundamentals
2.3.2 Rhythm handcrafted features
2.3.3 Musical genre handcrafted features
2.4 Data-driven systems
2.4.1 Machine learning fundamentals
2.4.2 Data-driven tempo estimation
2.4.3 Data-driven genre classification
2.5 Electronic/Dance Music in Music Information Retrieval
2.6 Conclusion
3 datasets
3.1 Introduction
3.2 Commonly used datasets
3.3 Electronic Dance Music Datasets
3.4 Discussion
4 deep rhythm
4.1 Introduction
4.2 Motivations
4.2.1 Harmonic representation of rhythm components
4.2.2 Adaptation to a deep learning formalism
4.3 Harmonic Constant-Q Modulation
4.3.1 Computation
4.3.2 Visual identification of tempo
4.4 Deep Convolutional Neural network
4.4.1 Architecture of the Convolutional Neural Network
4.4.2 Training
4.5 Aggregating decisions over time
4.5.1 Oracle Frame Prediction
4.5.2 Attention Mecanism
4.6 Evaluation
4.6.1 Tempo Estimation
4.6.2 Rhythm-oriented genre classification
4.7 Conclusion
5 deep rhythm extensions
5.1 Introduction
5.2 Complex Deep Rhythm
5.2.1 Why complex representation/convolution?
5.2.2 Complex HCQM.
5.2.3 Complex Convolution
5.2.4 Evaluation
5.3 Multitask Learning
5.3.1 Why multitask learning?
5.3.2 Multitask Deep Rhythm
5.3.3 Evaluation
5.4 Multi-Input network
5.4.1 Why multi-input network?
5.4.2 Multi-input Network
5.4.3 Evaluation
5.5 Conclusion
6 metric learning deep rhythm
6.1 Introduction
6.2 Metric learning principles
6.3 Losses
6.3.1 Evolution of metric learning losses
6.3.2 Triplet loss
6.4 Triplet Loss Deep Rhythm
6.4.1 Architecture
6.4.2 Training
6.5 Evaluation and Analysis
6.5.1 Datasets
6.5.2 Embedding space
6.5.3 Classifiers
6.5.4 Classification Results
6.6 Conclusion
7 conclusion
7.1 Summary and main contributions
7.2 Future Works
7.3 Overall conclusion
bibliography