Machine learning with neural networks and mean-field approximations

Get Complete Project Material File(s) Now! »

Machine learning with neural networks and mean-field approximations (Context and motivation)

This chapter provides the fundamental concepts in machine learning that will be used throughout this thesis. A comprehensive reference is [49]. We also take this chapter as an opportunity to introduce the current challenges in the understanding of deep learning. With these first elements of context, we briefly review the historical connections between mean-field methods coming from statistical physics and neural networks used for machine learning.

Neural networks for machine learning

Machine learning is traditionally divided into three classes of problems: supervised, unsupervised and reinforcement learning. For all of them, the advent of deep learning techniques, relying on deep neural networks, has brought great leaps forward in terms of performance and opened the way to new applications. Nevertheless, the utterly eﬃcient machinery of these algorithms remains full of theoretical puzzles. In this Section we quickly cover key concepts in supervised and unsupervised machine learning that will be useful to understand the context of this thesis and motivate its contributions.

Supervised learning

Learning under supervision
Supervised learning aims at discovering systematic input to output mappings from examples. For instance, classification is a typical supervised learning problem. For example, from a set of pictures of cats and dogs labelled accordingly, the goal is to find a function able to predict in any new picture the species of the displayed pet.
In practice, the training set is a collection of P example pairs D = {x(k), y(k)}Pk=1 from an input data space X ⊆ RN and an output data space Y ⊆ RM . Formally, they are assumed to be i.i.d. samples from a joint distribution p(x, y). The predictor h is chosen by a training algorithm from a hypothesis class, a set of functions from X to Y, so as to minimize the error on the training set. This error is formalized as the empirical risk
Neural network training
Given an architecture defining hθ, the supervised learning objective is to minimize the empirical risk ˆ with respect to the parameters . This optimization problems lives in the dimension of the R θ number of parameters which can range from tens to millions. The idea underlying the majority of training algorithms is to perform a gradient descent (GD) from a random initialization of the parameters:
θ0 ∼ pθ0 (θ0) 1 P y(k), hθt (1.10)
θt+1 ← θt − ηrθRˆ x(k) . (1.11)
= θt − η P k=1 rθ`
The parameter η is the learning rate, controlling the size of the step in the direction of decreasing gradient per iteration. The computation of the gradients can be performed in time scaling linearly with depth by applying the derivative chain-rule leading to the back-propagation algorithm [49]. A popular alternative to gradient descent is stochastic gradient descent (SGD) where the sum over the gradients for the entire training set is replaced by the sum over a small number of samples, randomly selected at each step [121, 19].
During the training iterations, one typically monitors the training error (another name for the empirical risk given a training data set) and the validation error. The latter corresponds to the empirical risk computed on a set of points held-out from the training set, the validation set, to assess the generalization ability of the model either along the training or in order to select hyperparameters of training such as the value of the learning rate. A posteriori, the performance of the model is judged from the generalization error, which is evaluated on the never seen test set. While two diﬀerent training algorithms (e.g. GD vs SGD) may achieve zero training error, they may diﬀer in the level of generalization they typically reach.
Open questions and challenges
Building on the fundamental concepts presented in the previous paragraphs, practitioners managed to bring deep learning to unanticipated performances in the automatic processing of images, speech and text (see [79] for a few years old review). Still, many of the greatest successes in the field of neural network where obtained using ingenious tricks while many fundamental theoretical questions at a very basic level remain unresolved.
Regarding the optimization first, (S)GD training generally discovers parameters close to zero risk. Yet gradient descent is guaranteed to converge to the neighborhood of a global minimum only for a convex function and is otherwise expected to get stuck in a local minimum. Therefore eﬃciency of gradient based optimization is a priori a paradox given the empirical risk ˆ is non-convex in the parameters θ. Second, the representation capacity of deep learning models is also poorly understood. Results in the literature that relate the size and architecture of a network to a measure of its ability to learn are too far from realistic settings to guide choices of practitioners. On the one hand, traditional bounds in statistics, considering worst cases, appear overly pessimistic. On the other hand, historical statistical physics analyses of learning, briefly reviewed in Section 1.2, only concern simple architectures and synthetic data. This lack of theory results in potentially important waste: in terms of time lost by engineers in trial and error to optimize their solution, and in terms of electrical resources used to train and re-train possibly oversized networks and store potentially un-necessarily large training data sets. Lastly, the generalization ability of deep neural networks trained by gradient descent is still unaccounted for. The size of training data sets is limited by the cost of human labelling. Thus training a deep and wide network amounts in practice to fitting a model of millions of degrees of freedom against a somehow relatively small amount of data points. Nevertheless it does not systematically lead to overfitting. The resulting neural networks can have good predictions both on inputs seen during training and on new inputs.
The success of deep learning, beyond these apparent theoretical puzzles,certainly lies in the interplay of advantageous properties of training algorithms, the neural network hypothesis class and structures in typical data (e.g. real images, conversations). Disentangling the role of the diﬀerent ingredients is a very active line of research (see [153] for a review) among which this thesis is a modest contribution to. In particular, Chapter 5 touches the question of the generalization capability of neural networks.

Unsupervised learning

Density estimation and generative modelling
The goal of unsupervised learning is to directly extract structure from data. Compared to the supervised learning setting the training data set is made of a set of example inputs D = {x(k)}Pk=1 without corresponding outputs. A simple example of unsupervised learning is clustering, consist-ing in the discovery of unlabelled subgroups in the training data. Most unsupervised learning algorithms either implicitly or explicitly adopt a probabilistic viewpoint and implement density estimation. The idea is to approximate the true density p(x) from which the training data was sampled by the closest (in various senses) element among a family of parametrized distributions over the input space {pθ(.), θ ∈ RNθ }. The selected pθ is then a model of the data. Structural properties of the data can sometimes be inferred from it, such as dependencies according to its factorization (see Section 2.1.1). If the model pθ is easy to sample, it can also be used to generate new inputs comparable to the training data points – which leads to the terminology of generative models. In this context, unsupervised deep learning exploits the representational power of deep neural networks to create sophisticated candidate pθ.
The detail of the optimization algorithm here depends on the specification of pθ. As we will see, the likelihood in itself is often intractable and learning consists in a gradient ascent on at best a lower bound, otherwise an approximation, of the likelihood.
A few years ago, an alternative strategy called adversarial training was introduced [50] leading to a remarkable quality of samples produced by generative models. However, the training of models appearing in this thesis, presented in the following paragraphs, will fall under the classical maximum-likelihood paradigm.
Deep Variational Autoencoders
A variational autoencoder (VAE) [71, 120] defines a density pθ obtained by propagating a simple distribution through a deep neural network. It can be formalized by introducing a latent variable z ∈ RN and a deep neural network hθ similar to (1.5) of input dimension N:
z ∼ pz( z ) (1.14)
x ∼ pθ( x | z ) = pout( x |hθ( z )), (1.15)
where pz is typically a factorized distribution on RN easy to sample (e.g. a standard normal distribution), and pout(.|hθ(z)) is for instance a multivariate Gaussian distribution with mean and covariance that are function of hθ(z). The computation of the likelihood of one training sample x(k) requires then the marginalization over the latent variable z, pθ(x) = dz pout(x|hθ(z))pz(z). (1.16)
This multidimensional integral cannot be performed analytically in the general case. It is also hard to evaluate numerically as it does not factorize over the dimensions of z which are mixed by the neural network hθ. Yet a lower bound on the log-likelihood can be defined by introduc-ing a tractable conditional distribution q(z|x) that will play the role of an approximation of the intractable posterior distribution pθ(z|x) implicitly defined by the model:
log pθ( x ) = Z d z q( z | x ) log pθ( x ) (1.17) = Z d z q( z | x ) log pθ( x , z )
pθ( z | x ) = Z d z q( z | x ) log pθ( x , z ) × q( z | x )
pθ( z | x ) q( z | x ) = Z dz q(z|x) − log q(z|x) + log pθ(x, z) + KL(q(z|x)||pθ(z|x)) ≥ Z
dz q(z|x) − log q(z|x) + log pθ(x, z) = LB(q, θ, x) (1.18)
where the last inequality comes from the fact that a KL-divergence is always positive; it is replaced by an equality if and only if q(z|x) = pθ(z|x). Provided q has a tractable expression and is easy to sample, the remaining lower bound LB(q, θ, x) can be approximated by a Monte Carlo method. Maximum likelihood learning is then approached by the maximization of the lower bound LB(q, θ, x), which requires in practice to parametrize the tractable posterior q = qφ as well, typically with a neural network. In order to adjust the parameters θ and φ through gradient descent a last reformulation is nevertheless necessary. The expectation over qφ estimated via Monte Carlo indeed involves the φ parameters. To disentangle this dependence z ∼ qφ(z|x) must be rewritten as z = g( , φ, x) with ∼ p ( ) independent of φ. This so-called re-parametrization trick allows to compute the gradients of LB(qφ, θ, x) with respect to θ and φ.
In this thesis we will mention variational autoencoders in the perspectives of Chapter 4 and Chapter 6. We will discuss a diﬀerent proposition to approximate the likelihood and the alternative Bayesian learning objective involving prior knowledge on the parameters θ.
Restricted and Deep Boltzmann Machines
Models described in the preceding paragraphs comprised only feed forward neural networks. In feed forward neural networks, the state or value of successive layers is determined following the recursion (1.7) – (1.9), in one pass from inputs to outputs. Boltzmann machines instead involve undirected neural networks which consist of stochastic neurons with symmetric interactions. The probability associated to a neuron state is a function of neighboring neurons, themselves reciprocally function of the first neuron. Sampling a configuration therefore requires an equilibration in the place of a simple forward pass.
A Restricted Boltzmann Machine (RBM) [3, 139] with M hidden neurons defines a joint distri-bution over an input (or visible) layer x ∈ {0, 1}N and a hidden layer t ∈ {0, 1}M
pθ(x, t) = 1 e a > x + b > t + x > Wt ,θ = {W , a, b} , (1.19)
where Z is the normalization factor, similar to the partition function of statistical physics. The parametric density model over inputs is then the marginal pθ(x) = t {0,1 }M pθ(x, t). Identically to VAEs, RBMs can represent sophisticated distributions at the cost of an intractable likelihood.
Indeed the summation over 2M+N terms in the partition function cannot be simplified by an analytical trick and is only realistically doable for small models.
This intractibility remains a priori an issue for the learning by gradient ascent of the log-likelihood. Given an input training point x(k), the derivatives of log p(x(k)) with respect to trainable parameters are
ra log p( x (k)) = x (k) − h x i , (1.20a)
r b log p( x (k)) = h t i x (k) − h t i , (1.20b)
r W log p( x (k)) = h t x >i x (k) − h t x >i, (1.20c)
where we introduce the notation h.i for the average over the RBM measure (1.19) and h.ix(k) for the average over the RBM measure conditioned on x = x(k), also called the clamped RBM measure. To approximate the gradients, sampling over the clamped measure is actually easy as the hidden neurons are independent when conditioned on the inputs; in other words p(t|x) is factorized. Yet sampling over the complete RBM measure is more involved and requires for example the equilibration of a Markov Chain. It is a costly operation that would need to be repeated at each parameter update in the gradient ascent. Instead it is commonly approximated by contrastive divergence (CD) [55], which approximates the h.i in (1.20) by the final state of a Monte Carlo Markov chain initialized in x(k) and updated for a small number of steps.
RBMs were the first eﬀective generative models using neural networks. They found applications in various domains including dimensionality reduction [57], classification [78], collaborative filtering [125], feature learning [27], and topic modeling [58]. Used for an unsupervised pre-training of deep neural networks layer by layer [56, 14], they also played a crucial role in the take-oﬀ of supervised deep learning.
Furthermore, they can be generalized to Deep Boltzmann Machines (DBMs) [123], where several hidden layers are stacked on top of each other. The measure defined by a DBM of depth L is
pθ(x, t) = 1 ea>x+x>W (1) t+b t (L) + Pl=1 t (l)> W (l+1) t (l+1) +b t (1.21)
(L)> L−1 (l)> (l) Z θ = {a , b(l), W (l) ; l = 1 • • • L}. (1.22)
Similarly to (1.20), the log-likelihood gradient ascent with respect to the parameters involves averages over the DBM measure and the corresponding clamped measure. In this case both of them involve an equilibration to be approximated by a Markov Chain. The contrastive divergence algorithm can be generalized to circumvent this diﬃculty, but the approximation of the gradients appear all the more less controlled.
Open questions and challenges
Generative models involving neural networks such as VAE, GANs and RBMs have great expressive powers at the cost of not being amenable to exact treatment. Their training, and sometimes even their sampling requires approximations. From a practical standpoint, whether these approxima-tions can be either made more accurate or less costly is an open direction of research. Another important related question is the evaluation of the performance of generative models [122]. To start with the objective function of training is very often itself intractable (e.g. the likelihood of a VAE or a RBM), and beyond this objective, the unsupervised setting does not define a priori a test task for the performance of the model. Additionally, unsupervised deep learning inherits some of the theoretical puzzles already discussed in the supervised learning section. In particular, assessing the diﬃculty to represent a distribution and select a suﬃcient minimal model and/or training data set seems today far out of reach of theory.
In this thesis, unsupervised learning models will be discussed in particular in Chapter 4. A novel deterministic framework for RBMs and DBMs with a tractable learning objective is derived, allowing to exploit eﬃciently their representative power.

READ Quasi-Monte Carlo Variational Inference

A brief history of mean-field methods for neural networks

The mathematical description of learning implies to deal with a large set of correlated random vari-ables: the data, the network parameters, and the resulting neuron activations. The major hurdle in our theoretical understanding of algorithms based on deep neural networks is the diﬃculty in performing computations involving these high-dimensional multivariate distributions. This is one aspect of the so-called curse of dimensionality. Here, exact computations are almost never possible and numerical treatments are unrealistically costly. Finding good approximations is therefore cru-cial. Mean-field methods can be understood as one approximation strategy. Originally developed in the statistical physics literature, their applications to machine learning, and in particular to neural networks, already have a long history and significant contributions to their records.
In the 80s and 90s, a series of works pioneered the analysis of learning with neural networks through the statistical physics lense. By focusing on simple models with simple data distributions, and relying on the mean-field method of replicas, these papers managed to predict quantitatively important properties such as capacities – the number of training data point that could be memorized by a model – or learning curves – the generalization error (or population risk) of a model as a function of the size of the training set. This eﬀort was initiated by the study of the Hopfield model [5], an undirected neural network providing associative memory [60]. The analysis of feed forward networks with simple architectures followed (among which [44, 45, 104, 156, 95, 106, 37, 96, 7]). Physicists, accustomed to studying natural phenomena, fruitfully brought the tradition of modelling to their investigation of learning. Their approach was in contrast to the focus of the machine learning theorists on ‘worst case’ guarantees specific to an hypothesis class and that must hold for any data distribution (e.g. Vapnik-Chervonenkis dimension and Rademacher complexity). The originality of their approach, along with the heuristic character of the derivations of mean-field approximations, may nonetheless explain the minor impact of their theoretical contributions in the machine learning community at the time.
Before the deep learning era, mean-field methods probably had a greater influence in the practice of unsupervised learning through density estimation, where we saw that approximations are almost always necessary. In particular the simplest method of naive mean-field, that will also be our first example in Chapter 3, was easily adopted and even extended by statisticians (see e.g. [154] for a recent textbook and [16] for a recent review). The belief propagation algorithm is another example of a well known mean-field methods by machine learners – it was actually discovered in both communities. Yet, these applications rarely involved neural networks and rather relied on simple probabilistic models such as mixtures of elementary distributions. They also did not take full advantage of the latest, simultaneous developments in statistical physics of the mean-field theory of disordered systems.
In this context, the inverse Ising problem has been a notable exception. The underlying ques-tion, rooted in theoretical statistical physics, is to infer the parameters of an Ising model given a set of equilibrium configurations. This corresponds to the unsupervised learning of the parameters of a Boltzmann machine (without hidden units) in the machine learning jargon. The correspond-ing Boltzmann distribution, with pairwise interactions, is remarkable, not only to physicists, as it is the least biased model under the assumption of fixed first and second moments; in the sense that it maximizes the entropy. For this problem, physicists lead an eﬃcient cross-fertilization and proposed dedicated developments of advanced mean-field methods for applications in other fields, and in particular in bio-physics (see [100] for a recent review). Nevertheless, these works almost never considered the case of Boltzmann machines with hidden units, more common in the machine learning community, especially at the first hours of deep learning.
The language barrier between communities is undoubtedly a significant hurdle delaying the transfer of developments in one field to the other. In machine learning, the potential of the most recent progress of mean-field approximations was already advocated for in a pioneering workshop mixing communities in 1999 [105]. Yet their first widely-used application is possibly the Approxi-mate Message Passing (AMP) algorithm for compressed sensing in 2009 [34] – related to the belief propagation algorithm and even more closely to the TAP equations known in the physics of dis-ordered systems. Meanwhile, there have been much tighter connections between developments in statistical physics and algorithmic solutions in the context of Constraint Satisfaction Problems (CSPs). The very first popular application of advanced mean-field methods outside of physics, beyond naive mean-field and belief propagation, is probably the survey propagation algorithm [91] in 2002. It borrows from the 1RSB cavity method (that will not be mentioned in this dissertation) to solve eﬃciently certain types of CSPs.
The great leap forward in the performance of machine learning with neural networks brought by deep learning algorithms, along with the multitude of theoretical and practical challenges it has opened, re-ignited the interest of physicists. While the power of advanced mean-field theories, was mostly unexploited in this context at the time this thesis started, this strategy has yielded over the last couple of years a number of contributions that would be diﬃcult to list exhaustively (some of them were recently reviewed in [21]). In these works, applying mean-field ideas to deep neural networks has served diﬀerent objectives and required diﬀerent simplifying assumptions such as considering an infinite size limit, random (instead of learned) weights, or vanishing learning rates. Hence, there is no such a thing as one mean-field theory of deep neural networks and these contributions are rather complementary pieces of solving a great puzzle. In this thesis in particular, we investigated mean-field insights in models being explicitely trained, either on real or synthetic data sets. To this end we notably exploited some recent progress in the statistical physics treatment of disordered systems. In the following Chapters of this First Part we formally introduce the fundamental frameworks and mean-field tools underlying the research presented in the Second Part.

Table of contents :

I Background
1 Machine learning with neural networks and mean-field approximations (Context and motivation)
1.1 Neural networks for machine learning
1.1.1 Supervised learning
1.1.2 Unsupervised learning
1.2 A brief history of mean-field methods for neural networks
2 Statistical inference and statistical physics (Fundamental theoretical frameworks)
2.1 Statistical inference
2.1.1 Statistical models representations
2.1.2 Some inference questions in neural networks for machine learning
2.1.3 Challenges in inference
2.2 Statistical physics of disordered systems
3 Selected overview of mean-field treatments: free energies and algorithms (Techniques)
3.1 Naive mean-field
3.1.1 Variational derivation
3.1.2 When does naive mean-field hold true?
3.2 Thouless Anderson and Palmer equations
3.2.1 Outline of the derivation
3.2.2 Illustration on the Boltzmann machine and important remarks
3.2.3 Generalizing the Georges-Yedidia expansion
3.3 Belief propagation and approximate message passing
3.3.1 Generalized linear model
3.3.2 Belief Propagation
3.3.3 (Generalized) approximate message passing
3.4 Replica method
3.4.1 Steps of a replica computation
3.4.2 Assumptions and relation to other mean-field methods
3.5 Extensions of interest for this thesis
3.5.1 Streaming AMP for online learning
3.5.2 A special special case of GAMP: Cal-AMP
3.5.3 Algorithms and free energies beyond i.i.d. matrices
3.5.4 Model composition
II Contributions
4 Mean-field inference for (deep) unsupervised learning
4.1 Mean-field inference in Boltzmann machines
4.1.1 Georges-Yedidia expansion for binary Boltzmann machines
4.1.2 Georges-Yedidia expansion for generalized Boltzmann machines
4.1.3 Application to RBMs and DBMs
4.1.4 Adaptive-TAP fixed points for binary BM
4.2 Applications to Boltzmann machine learning with hidden units
4.2.1 Mean-field likelihood and deterministic training
4.2.2 Numerical experiments
4.3 Application to Bayesian reconstruction
4.3.1 Combining CS and RBM inference
4.3.2 Numerical experiments
4.4 Perspectives
5 Mean-field inference for information theory in deep supervised learning
5.1 Mean-field entropy for multi-layer models
5.1.1 Extension of the replica formula to multi-layer networks
5.1.2 Comparison with non-parametric entropy estimators
5.2 Mean-field information trajectories over training of deep networks
5.2.1 Tractable deep learning models
5.2.2 Training experiments
5.3 Further investigation of the information theoretic analysis of deep learning
5.3.1 Mutual information in noisy trainings
5.3.2 Discussion
6 Towards a model for deep Bayesian (online) learning
6.1 Cal-AMP revisited
6.1.1 Derivation through AMP on vector variables
6.1.2 State Evolution for Cal-AMP
6.1.3 Online algorithm and analysis
6.2 Experimental validation on gain calibration
6.2.1 Setting and update functions
6.2.2 Offline results
6.2.3 Online results
6.3 Matrix factorization model and multi-layer networks
6.3.1 Constrained matrix factorization
6.3.2 Multi-layer vectorized AMP
6.3.3 AMP for constrained matrix factorization
6.4 Towards the analysis of learning in multi-layer neural networks
Conclusion and outlook
Appendix
A Vector Approximate Message Passing for the GLM
B Update functions for constrained matrix factorization
Bibliography