Physics joins probability: all you need for optimal estimation A selection of high-dimensional problems 

Get Complete Project Material File(s) Now! »

Critical treatment of previous approaches

In this section, we briefly sketch the computation of [KKM+16], and detail which approximations  used in this work are actually not valid in the thermodynamic limit. The authors derived the free entropy of this problem, in the Bayes-optimal setting, in two different ways, namely via the replica method and via belief propagation (BP) equations. We believe that both their approaches are based on incorrect assumptions.
These wrong hypotheses are also at present in the derivation of the BiGAMP (and BiG-VAMP) algorithm (cf. [PSC14a, PSC14b, ZZY20] among others). We therefore believe that these algorithms are also not able to give exact asymptotic computation of the marginal probabilities in this problem.
Let us now describe both approaches taken in [KKM+16], and explain how the assumptions behind them fail. We focus primarily on the replica analysis, and briefly describe the messagepassing approach.

TAP equations and PGY expansion

In this section, we detail the Plefka-Georges-Yedidia (PGY) expansion applied to extensiverank matrix factorization. We mainly focus on detailing the method and its results for the non-symmetric Model FX. In Sections 3.3.1 to 3.3.3 we describe the PGY expansion approach to deriving the TAP equations, and we discuss their relation with the previous approaches described in Section 3.2. We end this chapter by Section 3.3.4, in which we generalize our findings to the symmetric Model XX|.

Main contributions of this chapter

In this chapter we will go beyond the mentioned literature in two main directions. Our first contribution consists in a proof of the replica-symmetric formula for the free entropy, which was conjectured in the statistical physics literature. This proof uses an adaptive interpolation method [BM19a, BKM+19], that allows to put several of these results on a rigorous basis.
Secondly, we design an Approximate-Message-Passing-type algorithm (recall that we introduced such algorithms in Section 1.4) that is able to achieve the optimal generalization error for a wide range of parameters. The study of AMP —that is widely believed to be optimal among all polynomial-time algorithms in the above setting [DJM13, DM15, ZK16, BPW18]— unveils, in the case of the committee machine with a large number of hidden neurons, the existence a large hard phase. In this phase, learning is information-theoretically possible, leading to a generalization error decaying asymptotically as O(K/) (in the = (K) regime), but where AMP fails and provides only a poor generalization that does not go to zero when increasing . This strongly suggests that no efficient algorithm exists in this hard region and therefore that there is a computational gap in learning in such neural networks. In other problems where a hard phase was identified its study boosted the development of algorithms that are able to match the predicted thresholds, and we hope that this study will contribute in this regard.

Picking from the toolbox

Now that we fixed our model of interest, we can pick our favorite tool from the statistical physics toolbox to study optimal learning. In this chapter, we leave aside the TAP approach (the reader interested in the latter can refer to Chapters 2 and 3) and we focus instead on two other methods.
• First, we leverage the replica method that we introduced in Section 1.3.1. Recall that the aim of this method is the compute the large n limit of the free entropy fn of Model 4.1, i.e. the log-normalization of the posterior that is written in eq. (4.5). It allows to obtain an explicit (conjectural) expression of fn in the high-dimensional limit n,m ! 1 with = m/n fixed, called the replica-symmetric (RS) formula. One can then naturally ask: can we prove said conjecture? While proving the replica method itself seems out of reach, in this chapter we manage to prove its prediction by use of probabilistic methods: this is summarized by Theorem 4.1, which is the main theoretical result of this chapter. For this reason, while we first derived the results of Theorem 4.1 with the replica method, we focus in this chapter on its probabilistic proof, while the replica computation is detailed in Appendix B.1.
• The second tool, complementary to the first, is the use of message-passing algorithms to assess the optimal algorithmic (among a large class of first-order methods, see Section 1.4.2) performance. We introduced these approaches in Section 1.4, and we will extend them to the present setting in Section 4.3.
We hope that this chapter will illustrate to the reader that by leveraging our toolbox one can gain a precise understanding of the computational and statistical optimal performances in a broad class of inference models, and that it can be instrumental in gaining a deeper comprehension of learning in neural networks.

Main theorem: the replica-symmetric formula

Some notation – Recall that S+ K is the set of semi-definite positive real symmetric K × K matrices. For any M 2 S+ K, we can uniquely define its square root p M = M1/2. Let us define for N 2 S+ K(R), the set S+ K(N) {M 2 S+ K s.t. N −M 2 S+ K}. Note that S+ K(N) is both convex and compact.

Table of contents :

I Statistical physics approach to inference models and neural networks 
1 The statistical physics toolbox 
1.1 Elements of Bayesian statistical inference
1.1.1 A motivating example: classifying data
1.1.2 Inference problems in high dimension and the Bayes-optimal setting
1.1.3 Generalized Linear Models
1.1.4 Gibbs-Boltzmann measure and the free entropy
1.1.5 Estimators
1.2 Intuitions from the physics of spin glasses
1.2.1 Why spin glasses?
1.2.2 Important concepts of spin glass theory
1.2.3 Some classical spin glass models
1.3 Static approximations to the free energy
1.3.1 Replica theory and replica symmetry breaking
1.3.2 Thouless-Anderson-Palmer approach
1.4 From physics to algorithms
1.4.1 Belief propagation (BP)
1.4.2 Approximate Message Passing (AMP): derivation and consequences
1.4.3 Three approximations for non-Gaussian inference problems
1.5 Some rudiments of probability and random matrix theory
1.5.1 Random matrix ensembles and asymptotic spectra
1.5.2 Large deviations
1.5.3 High-dimensional “spherical” integrals
2 Revisiting high-temperature expansions 
2.1 Organization of the chapter and main results
2.2 Plefka-Georges-Yedidia expansion step-by-step
2.2.1 Pedagogical derivation for a spherical SK-like model
2.2.2 Generalization to a bipartite model
2.3 PGY expansion for inference models
2.3.1 PGY expansion in generic models of pairwise interactions
2.3.2 High-temperature expansions and message-passing algorithms
2.4 Diagrammatics and free cumulants
2.4.1 Expectation of simple cycles and free cumulants
2.4.2 The expectation of generic diagrams
2.4.3 Concentration of the diagrams: a second moment analysis
2.4.4 Higher-order moments in the diagrammatics
2.4.5 Extension to bipartite models
2.4.6 A note on i.i.d. matrices
3 Towards exact solution of extensive-rank matrix factorization 
3.1 Introduction
3.1.1 Definition of extensive-rank matrix factorization
3.1.2 Organization of the chapter and summary of the results
3.2 Critical treatment of previous approaches
3.3 TAP equations and PGY expansion
3.3.1 Sketch of the computation
3.3.2 The series at order 2 and the approximation of [KKM+16]
3.3.3 Going to higher orders: open directions
3.3.4 Symmetric matrix factorization
II Physics joins probability: all you need for optimal estimation A selection of high-dimensional problems 
4 The physics of learning in a two-layers neural network 
4.1 Introduction: the committee machine
4.1.1 Classical physics predictions
4.1.2 Main contributions of this chapter
4.2 Main theoretical results
4.2.1 General probabilistic model
4.2.2 Picking from the toolbox
4.2.3 Main theorem: the replica-symmetric formula
4.3 Investigating computational-to-statistical gaps
4.3.1 Approximate Message-Passing
4.3.2 From two to more hidden neurons, and the specialization transition
4.4 Proof of the replica formula: adaptive interpolation
4.4.1 Interpolating estimation problem
4.4.2 Overlap concentration and fundamental sum rule
4.4.3 A technical lemma and an assumption
4.4.4 Matching bounds: adapting the interpolation path
5 Generative models, or how to exploit the structure in the data 
5.1 Generative models for spiked matrix estimation
5.1.1 Introduction: exploiting data structure
5.1.2 Inference model: spiked matrix estimation
5.1.3 Generative models for the data
5.1.4 Summary of main results
5.2 Analysis of optimal estimation
5.2.1 Mutual information: the replica method rigorous once again
5.2.2 Optimal performance and statistical thresholds: phase diagrams
5.2.3 Algorithmic optimal estimation
5.2.4 State evolution equations
5.3 LAMP: a spectral algorithm for generative priors
5.3.1 Linearizing the AMP equations
5.3.2 Random matrix perspective on the spectral methods
5.3.3 Application to real data recovery
5.4 Random matrix analysis of the transition
5.4.1 The bulk of eigenvalues: proof of Theorem 5.2
5.4.2 BBP-like transition: proof of Theorem 5.3
6 Phase retrieval: theoretical transitions and efficient algorithms 
6.1 The phase retrieval problem
6.2 Optimal estimation in GLMs with structured data
6.2.1 Replica free entropy and how to prove it
6.2.2 Algorithmic point of view: the G-VAMP algorithm
6.3 Weak and perfect recovery transitions
6.3.1 Weak recovery: beating a random guess
6.3.2 Perfect recovery for Gaussian signals in noiseless phase retrieval
6.3.3 Surprising consequences and open questions
6.4 Efficient algorithms: constructing optimal spectral methods
6.4.1 Universality of the optimal method
6.4.2 Linearized vector approximate message passing
6.4.3 The Bethe Hessian: TAP revisited
6.4.4 Unifying the approaches
6.5 Statistical and algorithmic analysis: numerical experiments
6.5.1 Optimal algorithms and computational gaps
6.5.2 Spectral methods: cheap and efficient
6.5.3 Real image reconstruction
III Towards a topological approach to high-dimensional optimization 
7 The complexity of high-dimensional landscapes 
7.1 Counting complexity: the Kac-Rice formula
7.1.1 How to “count” the complexity of a landscape?
7.1.2 The area formula
7.1.3 The Kac-Rice formula
7.1.4 The complexity of the pure spherical p-spin model
7.2 Kac-Rice for inference models: main results
7.3 Proof of the annealed complexity
7.3.1 Applying the Kac-Rice formula
7.3.2 The complexity at finite n
7.3.3 Concentration and large deviations
7.4 Towards a numerical solution?
7.4.1 The logarithmic potential of μ,[]
7.4.2 Heuristic derivation of simplified fixed point equations
7.5 The quenched complexity and the replica method
7.5.1 Computing the p-th moment
7.5.2 Decoupling replicas and the p # 0 limit
8 An excursion to large deviations in random matrix theory 
8.1 Why the large deviations of the eigenvalues?
8.1.1 The landscape of generalized linear models and variants
8.1.2 PCA for correlated data
8.1.3 Organization of the chapter
8.2 Large deviations of extreme eigenvalues of generalized sample covariance matrices
8.2.1 Some formal definitions and assumptions
8.2.2 Main result
8.3 Monte-Carlo simulations
8.4 Derivation of the rate function
8.4.1 General idea behind the method
8.4.2 Tilting the measure: a first attempt
8.4.3 Beyond the transition: a second tilting
8.4.4 Going further: the complex case and the left tail of the large deviations .
Afterword
Bibliography 
PhD publications
Numerical codes of PhD publications
Other references
Appendices
A Technicalities of the Plefka-Georges-Yedidia expansion 
A.1 Order 4 of the expansion for a spherical model
A.2 Generalizations of the diagrammatics
A.3 PGY for extensive-rank matrix factorization
A.4 The expansion for symmetric extensive-rank matrix factorization
B Details of replica computations 
B.1 Replica calculation for the committee machine
B.2 Replica computation for generic GLMs
C Proving the replica formula: details in the committee machine 
C.1 Positivity of some matrices
C.2 Properties of the auxiliary channels
C.3 Setting in the Hamiltonian language
C.4 Free entropy variation: Proof of Proposition 4.3
C.5 A few technical lemmas
D Technical results of Part II 
D.1 Generalization error in the committee machine
D.2 Large K limit in the committee machine
D.3 RMT analysis of the spiked matrix model
D.4 State evolution of spectral methods with generative prior
D.5 Derivation of thresholds in phase retrieval
D.6 Details of the spectral methods analysis for phase retrieval
E Details of the topological approach 
E.1 The quenched complexity calculation
E.2 Details of proof for the annealed complexity
E.3 The large deviations in the white Wishart case
E.4 The phase transition in the rate function
E.5 Technicalities on spherical integrals
E.6 Simplifying the rate function

READ 

GET THE COMPLETE PROJECT

Related Posts