Subject and Contributions of this Thesis
In this thesis, we follow the above-mentioned lines of work by investigating several aspects of temporality in deep unsupervised representation learning and generative modeling. Our contributions tackle novel applications and issues, mainly articulated around the role of dynamical systems in the improvement and understanding of unsupervised deep learning methods.
We more specifically consider three main research directions:
• general-purpose scalable representation learning for time series;
• the combination of generative modeling and representation learning within a dynamical system framework for the forecasting of complex spatiotemporal data such as natural videos and physical phenomena;
• the theoretical and empirical study of the training dynamics of a standard generative model through the lens of diﬀerential equations.
We introduce and summarize our contributions in the following, and then detail the organization of this document.
General-Purpose Unsupervised Representation Learning for Time Series
While supervised learning and forecasting for time series are active and profuse research directions, unsupervised representation learning for this data type remains an under-explored problem. Yet, developing general-purpose unsupervised methods for time series is important with regards to the scarcity and cost of acquiring human-labeled data in most applications, in particular those involving industrial and real-life data. The challenging and noisy nature of real-life time series also makes it preferable for representation learning methods to apply to series of unequal and high lengths. Still, existing unsupervised representation learning methods remain limited with respect to these considerations, besides lacking strong and thorough experimental evaluation.
Tackling these issues, we propose in this thesis a general-purpose unsupervised scalable representation learning method that can handle time series of unequal and high lengths. To this end, we train a deep neural network encoder outputting a fixed-length representation regardless of the size of the input time series thanks to a novel triplet loss relying on time-based positive and negative sampling. The eﬃciency and flexibility of the chosen encoder, based on dilated convolutions, coupled with the triplet loss requiring no decoder, ensure the generality and scalability of the proposed method. We then assess the quality of learned representations on downstream tasks on standard datasets, thereby showing their transferability and general applicability across diﬀerent data domains and tasks.
This contribution, detailed in Part II of this document, led to the following publication in an international conference.
Jean-Yves Franceschi, Aymeric Dieuleveut, and Martin Jaggi (2019). “Un-supervised Scalable Representation Learning for Multivariate Time Series”. In: Advances in Neural Information Processing Systems. Ed. by Hanna Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily Fox, and Roman Garnett. Vol. 32. Curran Associates, Inc., pp. 4650–4661.
Dynamical Systems and Representation Learning for Complex Spatiotemporal Data
Moving on from general-purpose representation learning, we then more particularly study representation learning for complex structured spatiotemporal data. The latter arise in large-scale applications involving the observation of moving human subjects, objects and physical entities; typical examples of such data include videos and physical phenomena. They still constitute a challenge for neural networks due to their complexity and the need for high computational power to handle them, thus motivating advances that could improve existing models with reasonable resources requirements.
We consider two main types of data: videos and physical phenomena. Videos have numerous applications with respect to autonomous systems, including robotics and self-driving cars. They necessitate predictive models which should generate realistic images and take into account the inherent stochasticity of the observed phenomena.
The applicability of these models highly depends on their representation learning abilities. Indeed, the latter are essential for downstream tasks like planning and action recognition, allowing autonomous systems to benefit from small-scale representations of the environment. The prediction of physical phenomena, possibly less random but more chaotic, is a recent application field of deep learning that still struggles to achieve results equivalent to more classical prediction methods relying on physical models. Representation learning is especially interesting for the latter to understand the prediction mechanisms of data-driven approaches for partially observable data. Therefore, in this thesis, we explore representation learning for this type of sequence via generative modeling and forecasting. For both considered applications, we design temporal generative prediction models for spatiotemporal data relying on learning meaningful and disentangled representations. We show that the long-term predictive performance and representation learning abilities of these models mutually benefit from each other. An influential modeling choice in this regard is the inspiration from dynamical systems for the design of the proposed temporal evolution models. More precisely, our models are based on discretizations of diﬀerential equations parameterized by neural networks, which we show to be particularly adapted to the learning of continuous-time dynamics typically involved in videos and physical phenomena.
These contributions, developed in Part III of this document, were presented in the hereunder two international conference publications.
Jean-Yves Franceschi, Edouard Delasalles, Mickaël Chen, Sylvain Lamprier, and Patrick Gallinari (July 2020). “Stochastic Latent Residual Video Pre-diction”. In: Proceedings of the 37th International Conference on Machine Learning. Ed. by Hal Daumé III and Aarti Singh. Vol. 119. Proceedings of Machine Learning Research. PMLR, pp. 3233–3246.
Jérémie Donà, Jean-Yves Franceschi, Sylvain Lamprier, and Patrick Galli-nari (2021). “PDE-Driven Spatiotemporal Disentanglement”. In: Interna-tional Conference on Learning Representations.
Study of Generative Adversarial Networks via their Training Dynamics
After highlighting the valuable role of dynamical systems for deep generative predictive models, we then leverage diﬀerential equations within a novel theoretical framework to analyze and explain the training dynamics of a popular but still misunderstood generative model: Generative Adversarial Networks (GANs).
We point out a fundamental flaw in previous theoretical analyses of GANs that leads to ill-defined gradients for the discriminator. Indeed, within these frameworks that neglect its architectural parameterization as a neural network, the discriminator is insuﬃciently constrained to ensure the existence of its gradients. This oversight raises important modeling issues as it makes these analyses incompatible with standard GAN practice using gradient-based optimization. We overcome this problem which impedes a principled study of GAN training, solving it within our framework by taking into account the discriminator’s architecture and training.
To this end, we leverage the theory of infinite-width neural networks for the discrimi-nator via its Neural Tangent Kernel (NTK) in order to model its inductive biases as a neural network. We thereby characterize the trained discriminator for a wide range of losses by expressing its training dynamics with a diﬀerential equation. From this, we establish general diﬀerentiability properties of the network that are necessary for a sound theoretical framework of GANs, making ours closer to GAN practice than previous analyses.
Thanks to this adequacy with practice, we gain new theoretical and empirical insights about the generated distribution’s flow during training, advancing our understanding of GAN dynamics. For example, we find that, under the integral probability metric loss, the generated distribution minimizes the maximum mean discrepancy given by the discriminator’s NTK with respect to the target distribution. We empirically corroborate these results via a publicly released analysis toolkit based on our framework, unveiling intuitions that are consistent with current GAN practice and opening new perspectives for better and more principled GAN models.
This contribution, explained in Part IV of this document, corresponds to the following preprint and is currently under review at an international conference.
Jean-Yves Franceschi, Emmanuel de Bézenac, Ibrahim Ayed, Mickaël Chen, Sylvain Lamprier, and Patrick Gallinari (2021). “A Neural Tangent Kernel Perspective of GANs”. In: arXiv: 2106.05566.
Outline of this Thesis
This document is organized as follows. Chapter 2 explains the state of the literature and the necessary background for the exposition of our contributions. The latter are presented in Chapters 3 to 6: Chapter 3 describes the proposed unsupervised representation learning method for time series, Chapters 4 and 5 respectively introduce our video prediction model and our predictive spatiotemporal disentangling method, and Chapter 6 details our analysis of GAN training dynamics. Finally, Part V, with Chapters 7 and 8, concludes this document with a discussion of the perspectives oﬀered by the exposed contributions. An appendix containing supplementary material for Chapters 3 to 6 is given in Appendices A to D.
Table of contents :
1.2. Subject and Contributions of this Thesis
1.2.1. General-Purpose Unsupervised Representation Learning for Time Series
1.2.2. Dynamical Systems and Representation Learning for Complex Spatiotemporal Data
1.2.3. Study of Generative Adversarial Networks via their Training Dynamics
1.2.4. Outline of this Thesis
2. Background and Related Work
2.1. Neural Architecture for Sequence Modeling
2.1.1. Recurrent Neural Networks
2.1.2. Neural Differential Equations
220.127.116.11. ODEs and PDEs
18.104.22.168. Differential Equations and Neural Networks
22.214.171.124. ODEs and Neural Network Optimization
126.96.36.199. Convolutional Neural Networks
2.2. Unsupervised Representation Learning for Temporal Data
2.2.1. Contrastive Learning
2.2.2. Learning from Autoencoding and Prediction
188.8.131.52. Learning Methods
184.108.40.206. Disentangled Representations
2.3. Deep Generative Modeling
2.3.1. Families of Deep Generative Models
220.127.116.11. Variational Autoencoders
18.104.22.168. Generative Adversarial Networks
22.214.171.124. Other Categories
2.3.2. Sequential Deep Generative Models
126.96.36.199. Temporally Aware Training Objectives
188.8.131.52. Stochastic and Deterministic Models for Sequence-to-Sequence Tasks
184.108.40.206. Latent Generative Temporal Structure
II. Time Series Representation Learning
3. Unsupervised Scalable Representation Learning for Time Series
3.2. Related Work
3.3. Unsupervised Training
3.4. Encoder Architecture
3.5. Experimental Results
220.127.116.11. Univariate Time Series
18.104.22.168. Multivariate Time Series
3.5.2. Evaluation on Long Time Series
3.6.1. Behavior of the Learned Representations Throughout Training
3.6.2. Influence of K
3.6.3. Discussion of the Choice of Encoder
III. State-Space Predictive Models for Spatiotemporal Data
4. Stochastic Latent Residual Video Prediction
4.2. Related Work
4.3.1. Latent Residual Dynamic Model
4.3.2. Content Variable
4.3.3. Variational Inference and Architecture
4.4.1. Evaluation and Comparisons
4.4.2. Datasets and Prediction Results
22.214.171.124. Stochastic Moving MNIST
126.96.36.199. KTH Action Dataset
188.8.131.52. BAIR Robot Pushing Dataset
4.4.3. Illustration of Residual, State-Space and Latent Properties
184.108.40.206. Generation at Varying Frame Rate
220.127.116.11. Disentangling Dynamics and Content
18.104.22.168. Interpolation of Dynamics
22.214.171.124. Autoregressivity and Impact of the Encoder and Decoder Architecture
5. PDE-Driven Spatiotemporal Disentanglement
5.2. Background: Separation of Variables
5.2.1. Simple Case Study
5.2.2. Functional Separation of Variables
5.3. Proposed Method
5.3.1. Problem Formulation Through Separation of Variables
5.3.2. Fundamental Limits and Relaxation
5.3.3. Temporal ODEs
5.3.4. Spatiotemporal Disentanglement
5.3.5. Loss Function
5.3.6. Discussion of Differences with Chapter 4’s Model
5.4.1. Physical Datasets: Wave Equation and Sea Surface Temperature
5.4.2. A Synthetic Video Dataset: Moving MNIST
5.4.3. A Multi-View Dataset: 3D Warehouse Chairs
5.4.4. A Crowd Flow Dataset: TaxiBJ
IV. Analysis of GANs’ Training Dynamics
6. A Neural Tangent Kernel Perspective of GANs
6.2. Related Work
6.3. Limits of Previous Studies
6.3.1. Generative Adversarial Networks
6.3.2. On the Necessity of Modeling Discriminator Parameterization
6.4. NTK Analysis of GANs
6.4.1. Modeling Inductive Biases of the Discriminator in the Infinite- Width Limit
6.4.2. Existence, Uniqueness and Characterization of the Discriminator
6.4.3. Differentiability of the Discriminator and its NTK
6.4.4. Dynamics of the Generated Distribution
6.5. Fined-Grained Study for Specific Losses
6.5.1. The IPM as an NTK MMD Minimizer
6.5.2. LSGAN, Convergence, and Emergence of New Divergences
6.6. Empirical Study with GAN(TK)2
6.6.1. Adequacy for Fixed Distributions
6.6.2. Convergence of Generated Distribution
6.6.3. Visualizing the Gradient Field Induced by the Discriminator
126.96.36.199. Qualitative Analysis of the Gradient Field
6.7. Conclusion and Discussion
7. Overview of our Work
7.1. Summary of Contributions
7.4. Other Works
8.1. Unfinished Projects
8.1.1. Adaptive Stochasticity for Video Prediction
8.1.2. GAN Improvements via the GAN(TK)2 Framework
188.8.131.52. New Discriminator Architectures
184.108.40.206. New NTK-Based GAN Model
8.2. Future Directions
8.2.1. Temporal Data and Text
8.2.2. Spatiotemporal Prediction
220.127.116.11. Merging the Video and PDE-Based Models
18.104.22.168. Scaling Models
22.214.171.124. Relaxing the Constancy of the Content Variable
8.2.3. NTKs for the Analysis of Generative Models
126.96.36.199. Analysis of GANs’s Generators
188.8.131.52. Analysis of Other Models