Universality of phase transitions in noiseless linear regression

Get Complete Project Material File(s) Now! »

The replica method for spin glass models

Introduction to spin glass models

Spin glass models were originally introduced to describe some metal alloys such as copper-magnesium or gold-iron. They diﬀer from standard ferromagnets where magnet spins align; or paramagnet where magnets spins are anti-aligned: in spin glasses, both behaviors compete to produce a non-regular pattern. The magnetic elements are modeled by N variables, that we call spins. Each spin is described by a scalar parameter Si, and we define S = {Si}i=1,…,N . Those spins interact with each others in a given way, and this interaction will influence how they behave together. Spin glasses are in fact very complex to study and show various phenomena of interest, both in-equilibrium and oﬀ-equilibrium. A nice and broad introduction to spin glasses in statistical physics can be found in [107].
The interaction between spins are modeled by the coupling variables J = {Jij}i,j=1,..,N which are distributed according to some probability distribution P (J). The spins are subjected to local fields h = {hi}i=1,…,N . For a given set J, the system is described by the Ising Hamiltonian, already mentioned in the introduction
A typical phenomenon in spin glass systems is frustration: depending on the sign of couplings Jij, two spins will tend to be aligned or anti-aligned to minimize the energy of the system. However, a given spin might be under two opposing constraints from two of its neighbours, and it is not clear which configuration is the most energetically favorable, as shown in Fig. 1.2. From the Hamiltonian
Let us take a step back and check where we are: we have computed the average free energy, as a function of one parameter q0. The state of our system is described by the minimum of this free energy: we only need to minimize it with respect to q0. To reach this stage, we have done several moves that could possibly lead to a mistake: we have reverted limits, chosen a particular ansatz, and assumed an analytic continuation of free energy for n → 0.
Solving (1.35) for h = 0 shows that q0 = 0 is the only solution for β < 1, but another non-trivial solution exists for β > 1. Since β plays the role of an inverse temperature, it means that the system undergoes a phase transition between a regime of high-temperature, where it has a paramagnetic phase, and a regime of low temperature, the spin glass phase. At h 6= 0, there is no phase transition.
We could stop here and be happy with our result, but since we have done several fishy steps, we need to be extra careful and double-check it. In fact, something does go wrong with our calculation, when we take the limit n → 0. We already see that taking n → 0 makes Φn a function of a negative number of parameters, which does not mean much, but we need some amount of faith – and some amount of craziness – to go through replica calculations. To evaluate the correctness of our solution we can probe the stability of the fixed point, by computing the which is well defined for integer values of n, and then taking n → 0 to see how it aﬀects them. If matrix Q is a minimum of Φn, the eigenvalues of the Hessian matrix must all be strictly positive at this point. However, a detailed analysis in the replica-symmetric ansatz [39] provides all eigenvalues of the Hessian matrix, and shows that for any value of h, one of them becomes negative as n → 0 if the temperature is suﬃciently low. Therefore our solution is not valid anymore as the fixed-point becomes unstable and is therefore not a global minimum, and we need to pick a diﬀerent ansatz for low temperature regime. Another way of seeing that something went wrong is to compute the entropy of the system. The entropy is positive by definition, being the logarithm of the number of configurations, however taking the RS ansatz we find a negative entropy at zero temperature. To understand how we can choose an appropriate ansatz for the matrix Q and what it would mean, we will have to dig deeper into the physical meaning of overlap parameters.

An overview of replica symmetry breaking

Pure states and ergodicity breaking

Until now, we simply went with the flow of the calculation, but let us see what statistical physics tell us about the behavior of this disordered spin system. By minimizing the free energy, we are looking for the Gibbs equilibrium state reached by the system. In fact, this equilibrium state can be seen as a mixture of several pure states, which themselves cannot be split: they form a basis of all states. In a pure state, correlations between spins need to vanish as the distance between them goes to infinity, which is a priori not the case for the equilibrum state. At low temperature, the system undergoes a breaking of ergodicity: only a subpart of the space of configurations is explored. In that case, several pure states coexist, and we can rewrite the average h•i on the Hamiltonian
Finally, we see that the fraction of elements equal to q in the matrix Q gives the average probability that two pure states of the system have overlap q. Hence, the parameters have a clear physical meaning, and correspond to possible overlaps among pure states. In the RS ansatz, we only have a single possible overlap q0 = Qab for all a 6= b, which means that only one pure state exists, and the overlap q0 is the self-overlap of this single pure state. At low temperature, as ergodicity breaking occurs, several pure states appear and hence give rise to several overlap values. Bearing this in mind, we can try to formulate another ansatz for the matrix Q.
1-step replica symmetry breaking
Now that we see the link between the overlap matrix Q and overlaps between pure states, we can turn to a more sophisticated ansatz. The puzzle is still very hard to solve: we have no idea how many pure states there are, what their self-overlap is, how many configurations they include… Once again, we will try to design a somewhat naive ansatz and see how well it performs (notably by comparing it to simulations), hoping that it will be well enough to encapsule the physical reality. The simplest – although not that simple – ansatz is the first step of replica symmetry breaking (1RSB), which was introduced by Parisi [123, 124]. Say that among the n replicas, there exists n/m groups of m replicas each. Of course, n needs to be a multiple of m for this to make sense. Let us further assume that the configurations within one state all have the same overlap q1, and that configurations within two diﬀerent states all have the same overlap q0. Basically, the phase space is divided into clusters, each cluster being a state containing a number of configurations, and all clusters having the same internal overlap and mutual overlap, as shown in Fig. 1.3. The replicas reproduce this structure. For instance, if we take n = 9 and m = 3, the 1RSB matrix Q reads
Since we are striving to keep some meaning into the quantities we are manipulating, we would like to keep P (q) a probability, hence smaller than 1, which implies that 0 ≤ m ≤ 1. Besides, we would like 0 ≤ q0 ≤ q1 since two configurations belonging to the same state should be “closer”. Our recipe is now complete: we need to plug this ansatz inside the replica calculation, to write the free energy as a function of q0, q1, m; and to extremize it to reach a set of saddle-point equations. There are still some diﬃculties along the way, but we will not detail them here since we were mainly interested in explaining the idea between replica calculations. However, it is worth noting that the result from the 1RSB ansatz is very close to numerical experiments, and much more satisfactory than the RS wrong result in the low temperature spin glass phase.
Beyond 1RSB
While the 1RSB ansatz performs well for the SK model, it is clear that is still a simplified way to model the phase space. We could keep repeating the same plan by splitting clusters into sub-clusters, and adding a new overlap q2: this is two-step replica symmetry breaking. in practice, the computation keeps getting harder, and it is not necessarily useful to increase complexity since the 2RSB scheme does not always show a significant diﬀerence from the 1RSB result.
Although somewhat hard to use at first try, the replica method is a very powerful tool to reach a heuristic solution for many problems. While it clearly lacks mathematical rigor, a huge eﬀort was made in order to prove replica results in several settings, and we will come back to this later in this thesis. The take-away message is that the replica method manages to capture some very deep and fundamental properties of disordered systems, and to incorporate them accurately in its structure: it is no wonder that it has been used so consistently in the past 40 years, and is still a stepping stone to formulate many questions, without completely unveiling its secrets. Let us stray away from the replica method for some time: it will show up again in a few paragraphs as a way of tackling some inference problems.

READ Iron Ore Reduction with Coal/Carbon

An introduction on statistical inference

In A study in scarlet, by Sir Arthur Conan Doyle, Sherlock Holmes makes his first appearance as Dr Watson reads one of his articles. The latter states that “From a drop of water, a logician could infer the possibility of an Atlantic or a Niagara without having seen or heard one of the other. (…) By a man’s finger nails, by his coat-sleeve, by his boot, by his trouser knees, by the callosities of his forefinger and thumb, by his expression, by his shirt cuﬀs – by each of these things a man’s calling is plainly revealed. That all united should fail to enlighten the competent enquirer in any case is almost inconceivable.” Watson does not buy it, and indignantly comments “What ineﬀable twaddle!” We may want to agree: Holmes’ deductions often seem phenomenal and somewhat far-fetched. Holmes then argues that this is no empty gibberish, since detectives often come to him, “lay all the evidence before [him],and [he is] generally able, by the help of [his] knowledge of the history of crime, to set them straight.” The reason why Holmes’ findings might not always convincing is the little amount of information that he has. The more data he has on a person, the more believable his deductions become. Nevertheless, he points out that he is able to draw conclusions thanks to his “special knowledge”. What makes Holmes such a gifted investigator is his shrewd use of inference.
Statistical inference describes the process of deducing information about a distribution of underlying data, from the knowledge of observations. Let us see how to formulate this in mathematical terms, taking up the approach of Bayesian inference. We consider a set of variables x = {xi}i=1,…,N and measurements y = {yµ}µ=1,…,M that contain some information on the variables.
• We assume that the data is distributed according to P (x), the prior distribution, that we may only know partially. Note that the data does not eﬀectively have to be a random variable, we merely use the probability notation to describe a belief about the values of x.
• The way in which the observations are generated or derive from the data is P (y|x), the likelihood distribution.
• The conditional probability of having data x given observations y is P (x|y), the posterior distribution.
They are linked through the well-known Bayes formula P (x|y) = P (y|x)P (x) . (1.52)
For Holmes and Watson, the variables could be information about the suspect’s life and where-abouts, while the observations are all the clues that are so brilliantly interpreted by Sherlock. The “special knowledge” he mentioned lies in distributions P (x) and P (y|x), and his deductions are a proxy for P (x|y). Apart from profiling a culprit, inference problems show up in many fields: machine learning, social science, information theory, biology, signal processing… They can be very complex and challenging, fuel a large amount of research, and applications are ev-erywhere. Statistical inference is particularly exciting in our modern era of big data: for many problems we can obtain a huge amount of information, thousands, millions of measurements. How accurate would Sherlock Holmes become if he also had access to the social media accounts, the GPS localization, and the message history of the suspect! Even Watson would not do too bad… We will hence focus on the large-dimensional thermodynamic limit: we will consider M, N → ∞ but will often keep their ratio α ≡ M/N of order 1.
Teacher-student scenario
Within the Bayesian inference set-up, some problems can be written in the teacher-student scenario.
i) We assume that a teacher generates the variables x0 from a probability distribution PT (x0), then the measurements are obtained through a likelihood probability PT (y|x0). The original variables x0 are called the ground truth.
ii) A student then starts from the observations y, and has some knowledge through distribu-tions PS(x) and PS(y|x), and would like to recover x0.

Table of contents :

Introduction
1 The replica method and the inverse Ising problem with sparse weights
1.1 The replica method for spin glass models
1.1.1 Introduction to spin glass models
1.1.2 Goal of the replica method
1.1.3 Example of replica-symmetric calculation: the Sherrington-Kirkpatrick model
Averaging the replicated partition function
Taking the limit N ! 1 first
Choosing an ansatz
Taking the limit n ! 0
1.1.4 An overview of replica symmetry breaking
Pure states and ergodicity breaking
Physical parameters
Physical meaning of the overlap matrix Q
1-step replica symmetry breaking
Beyond 1RSB
1.2 An introduction on statistical inference
Teacher-student scenario
Estimators
Bayes optimal versus mismatched case
1.3 Inverse Ising problem with sparse teacher couplings
1.3.1 Introduction of the inverse Ising problem
The maximum likelihood estimator
The pseudo likelihood and local estimators
Teacher-student scenario
Statistical mechanics analysis: general framework
Revisiting the fully-connected case
1.3.2 Details of the sparsely-connected case
Difficulty of the sparse case and oracle estimator
Ansatz on the estimator
Properties of h and ha Average free energy in the sparsely-connected case
1.3.3 Properties of the direct problem
Marginal distribution of the teacher model
Inverse correlation function
1.3.4 Applicable range of the ansatz
Zero gradient conditions for Validity of the ansatz on tree-like graphs
1.3.5 Numerical experiments RR graph case ER graph case
Square lattice case for comparison
Comparison with interaction screening
1.3.6 Discussion and limits
2 Universality of phase transitions in noiseless linear regression
2.1 Belief propagation
2.1.1 Factor graphs
2.1.2 An easy example: the Ising chain
2.1.3 BP on tree-like networks
BP equations
Simplification for pairwise models
Useful quantities
Bethe free energy
2.1.4 BP on loopy graphs
2.2 Theoretical background on linear regression
2.2.1 Introduction to linear regression
2.2.2 Sparse linear regression
Why sparse?
Bayes-optimal versus `1 reconstruction
Two types of matrices
Replica analysis for i.i.d. and right rotationally invariant matrices
Theoretical phase transitions
2.2.3 Approximate message passing algorithms
2.3 Universal transitions in noiseless compressed sensing
2.3.1 Equivalence between right rotationally invariant and Gaussian i.i.d. matrices
2.3.2 Universal transitions for structured matrices
Some types of structured matrices
Numerical results
2.3.3 Discussion and limits of the universality
3 Asymptotic errors for convex penalized linear regression
3.1 Definition of the problem
3.2 Statistical physics result: the replica formula
3.3 Vector approximate passing and its state evolution
3.3.1 MAP formulation of Vector approximate message passing
3.3.2 Equality of ˆx and VAMP’s fixed point
3.3.3 State evolution of VAMP
3.3.4 Equivalence of state evolution and replica equations
3.4 A simplified algorithm: Oracle VAMP
3.4.1 Idea of the proof
3.4.2 Definition of Oracle VAMP
3.4.3 Strong convexity and smoothness of a convex function
3.4.4 Lipschitz constants of Oracle VAMP’s operators
3.5 Smoothed problem and its convergence
3.5.1 Definition of the modified problem
3.5.2 Bounds on variance parameters A1 and A2
3.5.3 A note on non-separable denoisers
3.5.4 Convergence bound on Oracle VAMP
3.6 Analytic continuation and end of the proof
3.7 Applications and numerical experiments
3.7.1 Linear regression with row orthogonal matrices
3.7.2 Overparametrization and double descent
4 Asymptotic errors for convex generalized linear models
4.1 Introduction of the problem
4.2 Statistical physics result: the replica formula
4.2.1 Replica free energy
4.2.2 Sketch of proof
4.3 MLVAMP and its state evolution
4.3.1 MAP formulation of two-layer VAMP
4.3.2 Equality of ˆx and MLVAMP’s fixed point
4.3.3 State evolution of MLVAMP and its fixed point
4.4 Oracle MLVAMP as a dynamical system
4.4.1 Definition of Oracle MLVAMP
4.4.2 Compact form of Oracle MLVAMP
4.4.3 Recast of Oracle VAMP as a linear system
4.4.4 Lipschitz constants and constraint matrices
4.4.5 Bounds on variance parameters
4.5 Smoothed problem and end of the proof
4.5.1 Convergence of the smoothed problem
4.5.2 Analytic continuation
4.6 Numerical experiments
4.6.1 Matrix parameters and singular values
4.6.2 Regularization: elastic net
4.6.3 Loss functions
5 Rademacher complexity and replica free energy
5.1 Convergence bounds on the generalization gap
5.1.1 A friendly introduction to the generalization problem
5.1.2 The Vapnik-Chervonenkis theorem on uniform convergence
5.1.3 The Rademacher complexity
Rademacher bound on uniform convergence
Recovering the VC uniform convergence bound
5.2 Rademacher complexities on some hypothesis classes for i.i.d. data
5.2.1 Linear model
5.2.2 Perceptron model
5.3 The statistical physics approach
5.3.1 The Gardner capacity for classification problems
5.3.2 The Rademacher complexity and the ground state energy
5.3.3 A flavor of understanding of Rademacher bounds on generalization
5.4 Consequences and bounds for simple models
5.4.1 Ground state energies of the perceptron
5.4.2 Computing the ground state energy with the replica method
Reminder on the replica ansatz and calculation
General expressions of RS and 1RSB free energy for Gaussian i.i.d. data
Spherical perceptron
Binary perceptron
5.4.3 Teacher-student scenario versus worst case Rademacher
5.4.4 Committee machine with Gaussian weights
5.4.5 Extension to rotationally invariant matrices