Computational Methods of Information Geometry

Get Complete Project Material File(s) Now! »

To computational information geometry

The research field of computational information geometry gathers a broad commu-nity around the development and application of computational methods that rely on theoretical constructs from information geometry. This community notably inter-sects the communities of machine learning and of signal processing. In particular, many techniques from machine learning and signal processing rely on the use of statistical models or distance functions to analyze and process the data. It is there-fore a natural approach to elaborate computational methods based on information geometry, from the perspective of statistical manifolds or information metrics and divergences, and from the interplay between these notions.
Several authors have undertaken this approach with various purposes, such as studying theoretical aspects of Boltzmann machines [Amari et al. 1992], neural networks [Amari 1995], natural gradient learning [Amari 1998], robust estimation through minimization of divergences [Basu et al. 1998, Eguchi and Kano 2001], mean-field approximation [Tanaka 2000], hierarchies of probability distributions [Amari 2001], turbo codes [Ikeda et al. 2004], diﬀusion kernels [Laﬀerty and Lebanon 2005], prior selection [Snoussi 2005]. The information-geometric approach has also proved beneficial in a variety of applications such as data clustering and mining with α-divergences [Hero et al. 2002, Schwander and Nielsen 2011], data embedding and dimensionality reduction with the Fisher information [Carter et al. 2009, 2011], shape analysis with information metrics [Peter and Rangarajan 2006, 2009], blind source separation with independent component analysis in the space of estimating functions [Amari and Cardoso 1997, Amari 1999], or with robust estimation based on minimization of divergences [Mihoko and Eguchi 2002, Eguchi 2009].
In this context, certain divergences have been employed extensively. This is in particular the case of Bregman divergences and of their extensions, because of their links with convex optimization through convex duality, and with statistical exponen-tial families through dually flat spaces. These divergences have notably been used to develop novel computational methods, often generalizing standard algorithms and schemes to a vast family of distance measures or related statistical models. Famous examples include the generalization of principal component analysis to exponential families based on the minimization of Bregman divergences [Collins et al. 2002], the extension of hard and soft clustering with consideration of k-means and ex-pectation-maximization within a unifying framework for exponential families and Bregman divergences [Banerjee et al. 2005], and the extension of least squares and normality assumptions in regression to generalized linear models with exponential families [Wainwright and Jordan 2008].
These divergences have also been employed in a variety of techniques such as boost-ing methods [Murata et al. 2004] and their relations to weighted clustering [Nock and Nielsen 2006], clustering with approximation guarantees [Nock et al. 2008], sur-rogates for learning [Nock and Nielsen 2009], matrix factorizations [Dhillon and Sra 2006, Dhillon and Tropp 2008], low-rank kernel learning [Kulis et al. 2009], mixture learning [Fujimoto and Murata 2007], simplification and hierarchical representations of mixtures of exponential families [Garcia and Nielsen 2010], contextual re-ranking [Schwander and Nielsen 2010], shape retrieval [Liu et al. 2010, 2012], distance learning for semi-supervised clustering [Wu et al. 2012].
They have also proved relevant for the generalization of several standard computa-tional geometry algorithms that were originally designed for the Euclidean distance, including nearest neighbor search [Cayton 2008, Nielsen et al. 2009a,b], range search [Cayton 2009], centroid computation [Nielsen and Nock 2009b, Nielsen and Boltz 2011, Vemuri et al. 2011], smallest enclosing balls [Nock and Nielsen 2005, Nielsen and Nock 2005, 2008, 2009a], Voronoi diagrams [Nielsen et al. 2007, Boissonnat et al. 2010, Nielsen and Nock 2011].
More generally, Bregman divergences, and other information divergences including Csiszár divergences, have revealed of key importance in statistical approaches to machine learning and signal processing. This had already been put in perspective in the early paper of Basseville [1989]. The literature on these issues has considerably expanded in the recent years and an up-to-date and thorough review is presented by Basseville [2013].

Outcomes of the thesis

Hereafter, we outline the directions of this thesis and summarize our main contri-butions in the realm of computational information geometry and signal processing. We also describe the publications and communications that arose out of this work.

Outline and main contributions

In the present work, we propose two independent algorithmic schemes that fall within the framework of computational information geometry. Although these methods naturally fit within the general domains of machine learning and signal processing, our initial motivations actually arise from two problems in audio signal processing, that of audio segmentation and that of polyphonic music transcription. Furthermore, we are deeply concerned with online machine listening, and we seek to design real-time systems to solve the two mentioned problems.
In this context, we address the problem of real-time audio segmentation by intro-ducing novel computational methods for sequential change detection with exponen-tial families. Concerning real-time polyphonic music transcription, we develop novel schemes for non-negative matrix factorization with convex-concave divergences. As discussed above, the two proposed algorithmic schemes are nonetheless of indepen-dent interest and directly applicable in other areas of statistical machine learning and signal processing. Therefore, the main body of this manuscript is organized into two parts, reporting respectively the theoretical contributions of the compu-tational methods developed on the one hand, and the applicative contributions of these methods to audio signal processing on the other hand. The outline of the thesis is shown in Figure 1 and can be discussed as follows.
In Chapter 1, we introduce the theoretical preliminaries on information geometry that are necessary to the developments of Chapter 2 and of Chapter 3. The chapter is further divided into two parallel sections corresponding to the mathematical con-structs required respectively for the two subsequent and independent chapters. We first present important results on exponential families of probability distributions in relation to convex duality and dually flat information geometry. These results are employed in Chapter 2 to develop computational methods for sequential change detection with exponential families. We then focus on introducing relevant no-tions about separable information divergences on the space of discrete positive mea-sures, including Csiszár divergences, Bregman divergences and their generalizations through Jeﬀreys-Bregman and Jensen-Bregman divergences, as well as α-divergences or β-divergences and their generalization through skew (α, β, λ)-divergences. This is employed in Chapter 3 to elaborate computational methods for non-negative matrix factorization with convex-concave divergences.
In Chapter 2, we elaborate on the novel computational methods for sequential change detection with exponential families. To the best of our knowledge, it is the first time that the celebrated problem of change detection is investigated in the light of information geometry. We follow a standard approach where change detec-tion is considered as a statistical decision problem with multiple hypotheses and is solved using generalized likelihood ratio test statistics. A major drawback of pre-vious work in this context is to consider only known parameters before change, or to approximate the exact statistics when these parameters are actually unknown. This is addressed by introducing exact generalized likelihood ratios with arbitrary estimators, and by expanding them for exponential families. By showing tight links between the computation of these statistics and of maximum likelihood estimates, we derive a generic scheme for change detection with exponential families under common scenarios with known or unknown parameters and arbitrary estimators. We also interpret this scheme within the dually flat information geometry of expo-nential families, hence providing both statistical and geometrical intuitions to the problem, and bridging the gap between statistical and distance-based approaches to change detection. The scheme is finally revisited through convex duality, leading to an attractive scheme with closed-form sequential updates for the exact general-ized likelihood ratio statistics, when both the parameters before and after change are unknown and are estimated by maximum likelihood. This scheme is applied in Chapter 4 to devise a general and unifying system for real-time audio segmentation.
In Chapter 3, we elaborate on the novel computational methods developed for non-negative matrix factorization with convex-concave divergences. We notably for-mulate a generic and unifying framework for non-negative matrix factorization with convex-concave divergences. This framework encompasses many common informa-tion divergences, such as Csiszár divergences, certain Bregman divergences, and in particular all α-divergences and β-divergences. A general optimization scheme is developed based on variational bounding with surrogate auxiliary functions for al-most arbitrary convex-concave divergences. Monotonically decreasing updates are then obtained by minimizing the auxiliary function. The proposed framework also permits to consider symmetrized and skew divergences for the cost function. In particular, the generic updates are specialized to provide updates for Csiszár di-vergences, certain skew Jeﬀreys-Bregman divergences, and skew Jensen-Bregman divergences. This leads to several known multiplicative updates as well as novel multiplicative updates for α -divergences, β-divergences, and their symmetrized or skew versions. These results are also generalized by considering the family of skew (α, β, λ)-divergences. This is applied in Chapter 5 to design a real-time system for polyphonic music transcription.
In Chapter 4, we investigate the problem of audio segmentation. We notably de-vise a generic and unifying framework for real-time audio segmentation, based on the methods for sequential change detection with exponential families developed in Chapter 2. A major drawback of previous works in the context of audio segmen-tation, is that they consider specific signals and homogeneity criteria, or assume normality of the data distribution. Other issues arise from the potential computa-tional complexity and non-causality of the schemes. The proposed system explicitly addresses these issues by controlling the information rate of the audio stream to detect changes in real time. The framework also bridges the gap between statistical and distance-based approaches to segmentation through the dually flat geometry of exponential families. We notably clarify the relations between various standard approaches to audio segmentation, and show how they can be unified and general-ized in the proposed framework. Various applications are showcased to illustrate the generality of the framework, and a quantitative evaluation is performed for musical onset detection to demonstrate how the proposed approach can leverage modeling in complex problems.
In Chapter 5, we investigate the problem of polyphonic music transcription. We notably elaborate a real-time system for polyphonic music transcription by employ-ing the computational methods for non-negative matrix factorization with convex-concave divergences developed in Chapter 3. We consider a supervised setup based on non-negative decomposition, where the music signal arrives in real time to the system and is projected onto a dictionary of note spectral templates that are learned oﬄine prior to the decomposition. An important drawback of existing approaches in this context is the lack of controls on the decomposition. This is addressed by using the parametric family of (α, β)-divergences, and by explicitly interpreting their relevancy as a way to control the frequency compromise in the decomposition. The proposed system is evaluated through a methodological series of experiments, and is shown to outperform two state-of-the-art oﬄine systems while maintaining low computational costs that are suitable to real-time constraints.
We conclude the manuscript with a general discussion and draw perspectives for future work. It is our hope that the presented contributions will bring interesting insights and directions for future research in the realm of audio signal processing, and more generally of machine learning and signal processing, in the relatively young but nonetheless prolific field of computational information geometry.

READ Electromechanical Behavior of Dielectric Elastomers

Related publications and communications

The present thesis has led to several publications and communications. This includes one book chapter, one journal article, three conference articles, two conference ab-stracts, one conference tutorial, and seven invited talks. These publications and communications are detailed below.
The work on sequential change detection with exponential families and its appli-cations to real-time audio segmentation was first presented as an article with oral presentation at a national peer-reviewed conference:
A. Dessein and A. Cont. Segmentation statistique de flux audio en temps-réel dans le cadre de la géométrie de l’information. In 23e Colloque du Groupe de Recherche et d’Études du Traitement du Signal (GRETSI), Bordeaux, France, September 2011a. [Dessein and Cont 2011a].
In this early version, we considered change detection in exponential families as a purely geometrical problem where the observations fall into Bregman balls onto the underlying statistical manifold, and are segmented according to the ball radii. This was applied to audio change detection based on spectral distributions for segmen-tation of polyphonic music into note slices. The ideas developed in this paper then matured as formulated in the thesis, that is, as an information-geometric problem where the statistical tests based on generalized likelihood ratio statistics are shown to be intrinsically linked with maximum likelihood estimation and with the canon-ical Kullback-Leibler divergences between the estimated distributions, or Bregman divergences on their parameters, in the diﬀerent change hypotheses. This leveraged other audio applications where the previous approach failed, such as silence and activity segmentation based on energy features, as well as segmentation into speech and music, or into diﬀerent speakers, based on timbral characteristics. Parts of this work have been accepted as an article in an international peer-reviewed journal:
A. Dessein and A. Cont. An information-geometric approach to real-time audio segmentation. IEEE Signal Processing Letters, 20(4):331–334, April 2013a. [Dessein and Cont 2013a].
A complementary paper has also been accepted as an article with oral presentation in an international peer-reviewed conference with edited book proceedings:
A. Dessein and A. Cont. Online change detection in exponential fam-ilies with unknown parameters before and after change. In F. Nielsen and F. Barbaresco, editors, Geometric Science of Information: First In-ternational Conference, GSI 2013, Paris, France, August 28-30, 2013, Proceedings, volume 8085 of Lecture Notes in Computer Science, pages 633–640. Springer, Berlin/Heidelberg, Germany, 2013b. [Dessein and Cont 2013b].
While the former article mostly focuses on audio processing applications with just the necessary theory to develop a real-time scheme for audio segmentation, the latter is rather targeted to the community of information geometry, and introduces more theoretical aspects as well as other applications to two well-known time series of the literature in geophysics and finance.
Concerning non-negative matrix factorization with convex-concave divergences and its applications to real-time polyphonic music transcription, our contributions were first presented as an article with poster presentation at an international peer-reviewed conference:
A. Dessein, A. Cont, and G. Lemaitre. Real-time polyphonic music tran-scription with non-negative matrix factorization and beta-divergence. In 11th International Society for Music Information Retrieval Conference (ISMIR), pages 489–494, Utrecht, Netherlands, August 2010a. [Dessein et al. 2010a].
In this preliminary work, we employed the β-divergences with no discussion about convergence guarantees. We also presented this work as an extended abstract with code submission at an evaluation campaign in the reference international competi-tion for tasks in music information retrieval:
A. Dessein, A. Cont, and G. Lemaitre. Real-time polyphonic music tran-scription with non-negative matrix factorization and beta-divergence. In 6th Music Information Retrieval Evaluation eXchange (MIREX), Ut-recht, Netherlands, August 2010b. [Dessein et al. 2010b].
During this contest, our system achieved results comparable to the state-of-the-art by finishing second out of six systems for the task of note transcription in polyphonic music, while actually being the only real-time system in competition. The system later evolved to include convergence guarantees for critical safety, in particular to ensure the monotonic decrease of the cost function along the iterations. This ex-tension has been accepted as a peer-reviewed book chapter in a collective work by world-leading researchers on matrix information geometry:
A. Dessein, A. Cont, and G. Lemaitre. Real-time detection of overlapping sound events with non-negative matrix factorization. In F. Nielsen and R. Bhatia, editors, Matrix Information Geometry, chapter 14, pages 341– 371. Springer, Berlin/Heidelberg, Germany, 2013. [Dessein et al. 2013].

Outcomes of the thesis

In this work, we also explored alternatives to the β-divergences by using a con-vex Euclidean cost function with sparsity penalty via non-negative sparse coding and `1-norm regularization. In addition, we considered other tasks of sound detec-tion with overlapping sources, namely drum transcription and environmental sound recognition. These parallel developments are however not described in the thesis for the sake of conciseness.
During the three years of the thesis, we also investigated several other applica-tions in audio signal processing and music information retrieval using computational methods of information geometry. In particular, we employed existing algorithms for parameter estimation in mixture models based on exponential families, as well as clustering and proximity search with Bregman divergences, for sample applica-tions in audio quantization, indexing, structuring and querying. These applications were however not developed as thoroughly as that of audio segmentation and poly-phonic music transcription, and are at present more at a proof-of-concept level. We therefore chose not to discuss them in this thesis and to focus instead on the more ac-complished work, with both theoretical and practical contributions, that is described above. Nonetheless, our researches on information geometry for audio signal pro-cessing have been presented in their general framework at several scientific venues. We notably presented a tutorial at an international peer-reviewed conference:
A. Dessein and A. Cont. Applications of information geometry to audio signal processing. In 14th International Conference on Digital Audio Eﬀects (DAFx), Paris, France, September 2011b. [Dessein and Cont 2011b].
We also had an abstract with poster presentation at a non-peer-reviewed national conference for doctoral students:
A. Dessein. La géométrie de l’information pour l’analyse et le traitement des flux audio. In 5es Journées Jeunes Chercheurs en Audition, Acous-tique musicale et Signal audio (JJCAAS), Marseille, France, November 2009b. [Dessein 2009b].
Last but not least, we presented our researches orally in various invited talks:
A. Dessein. Introduction to the tools of information geometry for the manipulation of audio streams. In Séminaire Mathématiques, Musique et relations avec d’autres disciplines (MaMuX), Institut de Recherche et Coordination Acoustique/Musique, Paris, France, October 2009a. [Des-sein 2009a].
A. Dessein. Some applications of non-negative matrix factorization and of information geometry in audio signal processing. In Research Seminar, RIKEN Brain Science Institute, Tokyo, Japan, October 2010a. [Dessein 2010a].

Table of contents :

Abstract
Preface
Introduction
Context of the thesis
From information geometry theory
To computational information geometry
Outcomes of the thesis
Outline and main contributions
Related publications and communications
I. Computational Methods of Information Geometry
1. Preliminaries on Information Geometry
1.1. Exponential families of probability distributions
1.1.1. Basic notions
1.1.2. First properties
1.1.3. Convex duality
1.1.4. Maximum likelihood
1.1.5. Dually flat geometry
1.2. Separable divergences on the space of discrete positive measures
1.2.1. Basic notions
1.2.2. Csiszár divergences
1.2.3. Skew Jeffreys-Bregman divergences
1.2.4. Skew Jensen-Bregman divergences
1.2.5. Skew (, , )-divergences
2. Sequential Change Detection with Exponential Families
2.1. Context
2.1.1. Background
2.1.2. Motivations
2.1.3. Contributions
2.2. Statistical framework
2.2.1. Multiple hypothesis problem
2.2.2. Test statistics and decision rules
2.3. Methods for exponential families
2.3.1. Generic scheme
2.3.2. Case of a known parameter before change
2.3.3. Case of unknown parameters before and after change
2.3.4. Generic scheme revisited through convex duality
2.3.5. Case of unknown parameters and maximum likelihood
2.4. Discussion
3. Non-Negative Matrix Factorization with Convex-Concave Divergences
3.1. Context
3.1.1. Background
3.1.2. Motivations
3.1.3. Contributions
3.2. Optimization framework
3.2.1. Cost function minimization problem
3.2.2. Variational bounding and auxiliary functions
3.3. Methods for convex-concave divergences
3.3.1. Generic updates
3.3.2. Case of Csiszár divergences
3.3.3. Case of skew Jeffreys-Bregman divergences
3.3.4. Case of skew Jensen-Bregman divergences
3.3.5. Case of skew (, , )-divergences
3.4. Discussion
II. Real-Time Applications in Audio Signal Processing
4. Real-Time Audio Segmentation
4.1. Context
4.1.1. Background
4.1.2. Motivations
4.1.3. Contributions
4.2. Proposed approach
4.2.1. System architecture
4.2.2. Change detection
4.3. Experimental results
4.3.1. Segmentation into silence and activity
4.3.2. Segmentation into music and speech
4.3.3. Segmentation into different speakers
4.3.4. Segmentation into polyphonic note slices
4.3.5. Evaluation on musical onset detection
4.4. Discussion
5. Real-Time Polyphonic Music Transcription
5.1. Context
5.1.1. Background
5.1.2. Motivations
5.1.3. Contributions
5.2. Proposed approach
5.2.1. System architecture
5.2.2. Non-negative decomposition
5.3. Experimental results
5.3.1. Sample example of piano music
5.3.2. Evaluation on multiple fundamental frequency estimation
5.3.3. Evaluation on multiple fundamental frequency tracking
5.4. Discussion
Conclusion
Summary of the present work
From sequential change detection to audio segmentation
From non-negative decomposition to polyphonic music transcription
Perspectives for future work
Change detection and audio segmentation
Non-negative matrix factorization and music transcription
General directions of investigation
Bibliography