From GMM to Concentration though GAN
The starting point to obtain the results of the previous section and more fundamentally to analyze the behavior of ML algorithms is to design so-called deterministic equivalents which basically encode the behavior of large random matrices. We will recall in the next chapter fundamental random matrix theory results along with some applications to sim-ple machine learning models such as spectral clustering with the Gram matrix and clas-sification with a linear classifier.
So far, the considered assumption on data to design such deterministic equivalents is a GMM model (see Definition 1) as developed by Benaych and Couillet [BGC16], where we recalled their main result in Theorem 2.5. However, real data (e.g., images) are un-likely close to simple Gaussian vectors and therefore one needs a more realistic model to describe them. Following the well-known quote of R. Feynman: “What I cannot create, I do not understand”, it is fundamentally important to be able to create real data in order to fully understand their nature. To this end though, generative models are of particular interest. In particular, since the advent of Generative Adversarial Nets [GPAM+14], it is now possible to create neural network models that can generate complex data struc-tures. Some examples of artificially generated images by the BigGAN model [BDS18] are depicted in Figure 1.3. In particular, GAN generated data are constructed by apply-ing successive Lipschitz transformations to high-dimensional Gaussian random vectors. Schematically,Real data GAN data = FL F1(z) with z N (0, Id) where the Fi’s are essentially Lipschitz transformations.
On the other hand, the fundamental concentration of measure phenomenon [Led05a] states that Lipschitz-ally transformed Gaussian vectors are concentrated vectors, i.e., they satisfy the following concentration property: A random vector x 2 E is said to be con-centrated if for any 1-Lipschitz function f : E ! R, for q 0 there exists C, s > 0 such that 8t > 0, P fj f (x) Ef (x)j tg Ce (t/s)q
Consequently, GAN data are concentrated random vectors by construction, and there-fore this class of random vectors constitute a more appropriate and convincing statistical model for realistic data compared to simple Gaussian vectors. The main objective of this thesis is to exploit the framework of concentrated vectors developed by Louart and Couillet [LC18b] relying on the earlier works of El Karoui [EK+10b], which is particularly motivated by the fact that GAN data belong to this class of random vectors, in order to analyze the behavior of various ML methods. We particularly provide in the following section some key results from this thesis that concern mainly the study of large Gram and kernel matrices under the concentration assumption (see Subsections 1.3.1-1.3.2) and further applications related to neural networks through the study of the Softmax layer in Subsection 1.3.3.
Some ML methods under Concentration
In this section we summarize the main findings of this thesis which basically describe the behavior of standard ML methods under the concentration assumption on data. Pre-cisely, we will assume throughout the following subsections that data are distributed as a Mixture of Concentrated Vectors as per the following definition.
Definition 2 (Mixture of Concentrated Vectors (MCV) [LC18b]). Let X = [x1, . . . , xn] 2 M p,n be a data matrix which is constituted of n random vectors distributed in k different classes C1, . . . , Ck, such that the data classes are characterized by the moments, for xi 2 C‘ E [xi] = m‘, E xi x|i | = S‘ + m‘m‘
In particular, the data matrix X satisfy a concentration assumption in the sense that, for any 1-Lipschitz function f : Mp,n ! R with Mp,n enrolled by the Frobinuous norm k kF, for q > 0 there exists C, s > 06 independent of p and n such that 8t > 0, P fj f (X) Ef (X)j tg Ce (t/s)q
Remark 1.5 (On the concentration of GAN random vectors). We will see in Chapter 3 that GAN random vectors have notably the same concentration as standard Gaussian vectors which unfolds from the fact that GAN’s generators networks have controlled Lipschitz norm; i.e., for Gaussian N (0, Id) inputs (as commonly considered) whose observable diameter does not depend on the dimension d, the observable diameter of GAN’s outputs does not increase with the data dimension. Moreover, the concentration of the data matrix X implies the concentration of its columns vectors xi’s since X 7!X ji = xi is 1-Lipschitz transformation, where ji 2 Rn is the canonical vector defined as (ji)j = di=j.
In addition to the concentration assumption on data, we will further assume the clas-sical random matrix theory regime as per the following assumption.
Assumption 1 (Growth rate). As p ! ¥, assume
1. p/n ! c 2 (0, ¥); jC‘j/n ! c‘ 2 (0, 1).
2. The number of classes k is bounded.
6 s is called the observable diameter.
Behavior of Gram Matrices
The first contribution of this thesis concerns the analysis of the behavior of the Gram matrix defined as G = 1p X|X under the concentration assumption in Definition 2 and the high-dimensional regime in Assumption 1. In particular, this contribution is presented in more details in Chapter 3 and is based on the following works:
(C1) MEA. Seddik, C. Louart, M. Tamaazousti, R. Couillet, “Random Ma-trix Theory Proves that Deep Learning Representations of GAN-data Behave as Gaussian Mixtures”, International Conference on Machine Learning (ICML’20), Online, 2020.
(C1’) MEA. Seddik, M. Tamaazousti, R. Couillet, “Pourquoi les matrices aléa-toires expliquent l’apprentissage ? Un argument d’universalité offert par les GANs”, Colloque francophone de traitement du signal et des images (Gretsi’19), Lille, France, 2019.
As will be recalled in Chapter 2, the spectral behavior of G can be analyzed through its resolvent
R(z) = (G + zIn) 1
The main result from this contribution is to provide a deterministic equivalent (see Defini-tion 6) for R(z) which is given by the following Theorem.
Theorem 1.1 along with the deterministic equivalent of the sample covariance ma-trix developed in [LC18b] particularly generalize the result of [BGC16] to the class of concentrated vectors. A remarkable outcome from this result is that the behavior of the Gram matrix G depends strictly on the first and second statistical moments fm‘gk‘=1 and fS‘gk‘= 1 of the data thereby providing a universal behavior of ML methods which rely on the Gram matrix regardless of the data distribution. This universality behavior has notably been verified (see Chapter 3) using GAN generated images as they satisfy the concentration assumption by design.
Behavior of Kernel Matrices
The second contribution of this thesis concerns the analysis of large kernel matrices under the concentration hypothesis. The results of this part are presented in Section 4.1 and are particularly based on the following work:
(C2) MEA. Seddik, M. Tamaazousti, R. Couillet, “Kernel Random Matrices of Large Concentrated Data: The Example of GAN-generated Images”, IEEE In-ternational Conference on Acoustics, Speech and Signal Processing (ICASSP’19), Brighton, United-Kingdom, 2019.
Specifically, we have analyzed in this work the behavior of large kernel matrices of the form K = f 1 kxi xjk2 p i,j=1
where f : R+ ! R+ is supposed to be three times continuously-differentiable in the vicinity of t = 2p tr åk‘=1 c‘S‘ , and the data matrix X = [x1, . . . , xn] 2 Mp,n is assumed to be concentrated in the sense of Definition 2. Under these assumptions and further the non-trivial growth rate assumptions from Subsection 1.1.2, our first result states that the between and within class data are “equidistant” in high-dimension independently from the classes, specifically, for some d > 0 with probability 1 d
Following the same approach as [EK+10b, CBG+16], the kernel matrix K can therefore be Taylor expanded entry-wise leading to an approximation of it in spectral norm by a spiked random matrix model of the form8 K = Pf + f 0(t)Z|Z + f 00(t)W + op(1)
where Pf is a low-rank informative matrix which contains information about the classes though the data statistics fm‘gk‘=1 and fS‘gk‘=1 and also depends on the kernel function f through its local first and second derivatives at t. The remaining matrix terms are non-informative full-rank noise matrices which exhibit a spreading of eigenvalues in the spectrum of K, in particular, the behavior of the term f 0(t)Z|Z is described by the de-terministic equivalent from the previous subsection. See Section 4.1 for more details. As for the Gram matrix, K exhibits a universal behavior since its random matrix equivalent depends strictly on the classes statistics fm‘gk‘=1 and fS‘gk‘=1.
Beyond Kernels to Neural Networks
Kernel matrices appear naturally as a backbone of kernel methods. For instance, it has been shown in [Lia19] that the MSE9 of random neural networks, the misclassification er-ror rate of kernel ridge regression and the performance of kernel/random feature-based spectral clustering methods, all depend explicitly on the eigenspectrum or on a certain functional of a particular random kernel/nonlinear Gram matrix. In this subsection, the aim is to go beyond kernel methods in order to analyze ML methods which have an im-plicit relationship with the input data, i.e., ML methods which are implicitly determined by (convex) optimization problems. In particular, relying on [MLC19] which studies the behavior of logistic regression under the RMT regime and using Gaussian assumptions on data, we push forward this study by assuming a k-class concentration model (see Def-inition 2) and therefore considering the more general Softmax classifier. Specifically, this subsection briefly presents our findings concerning the analysis of the Softmax classifier under the concentration hypothesis and the high-dimensional regime. This contribution is particularly detailed in Section 5.1 and is based on the following work: (C4) MEA. Seddik, C.Louart, R. Couillet, M. Tamaazousti, “The Unexpected Deterministic and Universal Behavior of Large Softmax Classifiers”, AISTATS 2021.
Table of contents :
1.1 Illustrative High-dimensional Examples
1.1.1 Large Sample Covariance Matrices
1.1.2 Large Kernel Matrices
1.2 From GMM to Concentration though GAN
1.3 Some ML methods under Concentration
1.3.1 Behavior of Gram Matrices
1.3.2 Behavior of Kernel Matrices
1.3.3 Beyond Kernels to Neural Networks
1.3.4 Summary of Section 1.3
1.4 Outline and Contributions
2 Random Matrix Theory & Concentration of Measure Theory
2.1 Fundamental Random Matrix Theory Notions
2.1.1 The resolvent matrix notion
2.1.2 Stieltjes transform and spectral measure
2.1.3 Cauchy’s integral and statistical inference
2.1.4 Deterministic and random equivalents
2.2 Fundamental Random Matrix Theory Results
2.2.1 Matrix identities and key lemmas
2.2.2 The Marˇcenko-Pastur Law
2.2.3 Random matrices of mixture models
2.3 Connections with machine learning through spiked models
2.3.1 Simple example of unsupervised learning
2.3.2 Simple example of supervised learning
2.4 Extensions with concentration of measure theory
2.4.1 The notion of concentrated vectors
2.4.2 Resolvent of the sample covariance matrix
3 Universality of Large Random Matrices
3.1 GAN Data are Concentrated Data
3.2 Random Gram Matrices of Concentrated Data
3.2.2 Model and Main Results
126.96.36.199 Mixture of Concentrated Vectors
188.8.131.52 Behavior of the Gram matrix of concentrated vectors
184.108.40.206 Application to GAN-generated Images
3.2.3 Central Contribution
4 Random Kernel Matrices of Concentrated Data
4.1 Kernel Spectral Clustering
4.1.2 Model and Main Results
220.127.116.11 Behavior of Large Kernel Matrices
18.104.22.168 Application to GAN-generated Images
4.1.3 Central Contribution and perspectives
4.2 Sparse Principal Component Analysis
4.2.2 Model and Main Results
22.214.171.124 Random Matrix Equivalent
126.96.36.199 Application to sparse PCA
188.8.131.52 Experimental Validation
4.2.3 Central Contribution and Perspectives
5 Beyond Kernel Matrices, to Neural Networks
5.1 A random matrix analysis of Softmax layers
5.1.2 Model setting: the Softmax classifier
5.1.3 Assumptions & Main Results
184.108.40.206 Concentration of the weights vector of the Softmax classifier
220.127.116.11 Experimental validation
5.1.4 Central Contribution
5.2 A random matrix analysis of Dropout layers
5.2.2 Model and Main Results
18.104.22.168 Deterministic equivalent
22.214.171.124 Generalization Performance of a-Dropout
126.96.36.199 Training Performance of a-Dropout
5.2.4 Central Contribution and Perspectives
6 Conclusions & Perspectives
6.2 Limitations and Perspectives
A Synthèse de la thèse en Français
B Practical Contributions
B.1 Generative Collaborative Networks for Super-resolution
B.1.2 Proposed Methods
B.1.2.1 Proposed Framework
B.1.2.2 Existing Loss Functions
B.1.3 Experiments for Single Image Super-Resolution
B.1.3.1 Proposed Methods
B.1.3.2 Evaluation Metrics
B.1.4 Central Contribution and Discussions
B.2 Neural Networks Compression
B.2.2 Proposed Methods
B.2.2.1 Setting & Notations
B.2.2.2 Neural Nets PCA-based Distillation (Net-PCAD)
B.2.2.3 Neural Nets LDA-based Distillation (Net-LDAD)
B.2.4 Central Contribution and Discussions
C.1 Proofs of Chapter 3
C.1.1 Setting of the proof
C.1.2 Basic tools
C.1.3 Main body of the proof
C.2 Proofs of Chapter 4
C.2.1 Proofs of Section 4.1
C.2.2 Proofs of Section 4.2
C.2.2.1 Proof of Theorem 4.3
C.2.2.2 Proof of Theorem 4.4
C.3 Proofs of Chapter 5
C.3.1 Proofs of Section 5.1
C.3.2 Proofs of Section 5.2
C.3.2.1 Proof of Theorem 5.5
C.3.2.2 Proof of Theorem 5.6