Statistical inference and linear estimation problems for the physicist layman

Get Complete Project Material File(s) Now! »

Parametric versus non parametric inference

Another important distinction is between parametric and non parametric problems. In the non parametric setting [8,19], no or very few prior knowledge is assumed about the object to infer, so no prior model is assumed to performthe inference. For example in prediction, the estimate ˆ f of f is chosen among all possible functions that are « smooth » enough, with only parameter being the level of smoothness ¸. This is a very large space, denoted by S (¸). Non parametric fitting is generally performed by applying a regularizing kernel to the observations.
The obtained processing function ˆ f is thus « just » a smoother version of the observations and it thus does not require any particular a priori structure or shape for f . On the opposite, parametric inference [8,10] makes strong assumptions about the structure of the infered object. For example, in the linear regression problem, ones assumes that y Æ f (X) Æ f|X, where f is a vector of coefficients linking the vector of observations to the matrix X, where each column is an input. The inference task is thus here to estimate these coefficients. Thus f is now chosen among a way more restricted space which is here RN, instead of S (¸). One could even more constrain the functionnal space of the model, requiring for example that only a small fraction of the coefficients of f are different from zero, i.e. that f is sparse. In this example, it is clear that calling f the process or the signal is irrelevant.
A natural question that arise from this discussion is: Why would anyone constrain the functionnal space in which the processing function is chosen ? that can be rephrased as: Why would anyone prefere parametric to non parametric inference ? Of course more degrees of freedom in the choice of f or s allows a priori for a better fit of the data but a more flexible model lower its interpretability i.e. it makes difficult the task of interpreting the relations between the data and observations. In contrast, a quite constrained sparse linear model is directly interpretable: the observations depend (approximately at least) only on a small subset of the inputs components, identified by the non-zero coefficients of the estimate of f.
This can be very useful in a medical application. For example, we could gather observations [yi ]N 1
2 {1,0}N telling if yes or not the patient i has some given cancer. The lines of the input matrix X could correspond to conditions possibly correlated to this particular cancer such as health features or habits: is the person smoking, obese, doing sport regularly, a man, an O blood type, etc : xki 2 {1,0} tells if yes or not the i th patient verifies the condition k. Then, if the inference outputs a sparse vector of coefficientsˆf such that y Æˆf|X, one gets deep insights about which features participate or not to this cancer (the features corresponding to the non zero coefficients ofˆf), and with which weight (the amplitudes of the non zero coefficients ofˆf): here model interpretability is absolutely essential to identify the true causes of the cancer. But going back to the trader example that want to make accurate predictions about future stock prices, he does not not really care about the interpretability, only good predictions matter. In this case, non parametric inference can be more appropriate.
Another disadvantage of non-parametric inference is its high potential of overfitting the noise. It means that it is a difficult task to estimate the proper smoothing parameter ¸ of the observations: a too smooth model will have a very poor predictive potential whereas a too rough model will fit the noise in the data instead of the data itself which again will generate a bad model. This is probably one of the most fundamental problem in any statistical learning problem, reffered as the bias-variance tradeoff [8].

The bias-variance tradeoff: what a good estimator is ?

A fundamental question in any statistical estimation problem is: Can we quantify the error we will make using a given estimator for the quantity we wish to infer ? This is in general very difficult to answer in a practical setting, actually most of the theoretical part of this thesis will be dedicated to this specific question. In spite of that, it appears that in such problems, we can always differentiate three distincts sources of error, namely the bias, the variance and the irreducible error. The problem is of course to quantify them. Let us demonstrate what they are and how they can be interpreted, so that we can assert what are the best results we can hope to obtain. We first restrict the fully general model (3.2) to additive noise channels which is of interest in the present thesis and quite a general model in the continuous framework: Pout (˜y) Æ ˜yÅ». We assume first that we want to infer the processing function. The appropriate object to quantify this estimation error is called the prediction risk : Rp :Æ Ey ¡ jjy¡ ˆyjj22 ¢ (3.3) ÆÇ Ey¹ ¡ (y¹ ¡ ˆ y¹)2¢ È.

Another source of error: the finite size effects

There are two ways to cancel the reducible error: to have access to an infinite number of observations generated from the same system or having directly access to the data generating model. For example if one wants to infer the signal s, Pout , f , and all the parameters of the problem µ including those of the signal µs in (3.2) must be known. But as the data is finite N Ç1, even when the model is perfectly known and thus the reducible error is inexistant, there can remain another source of error related to the algorithmthat performs inference: the finite size effects that induce a finite size error ²(N), where limN!1²(N) Æ 0. In the present thesis, the inference algorithm that we will use is the approximate message-passing algorithm derived and discussed in sec. 4.3. It is based on « law of large numbers-like » arguments and is only rigorous in the limit N !1. Thus when using it on finite size systems, the algorithm becomes an approximation and this can induce finite size errors.

The tradeoff between statistical and computationnal efficiency

With the explosion of the size of the data sets generated inmodern scientific, medical, social and economical applications, a major point to consider in todays inference techniques is the tradeoff between statistical and computationnal efficiency. A statistically efficient algorithm is able to infer the desired quantity accurately while remaining robust in spite of the precense of noise. This can be quantified by error estimators such as the MSE (3.12). A computationnally efficient algorithmhas a low complexity, i.e. it requires a number of fundamental operations to performits task that scales « nicely » with the size of the input and output data. Lets precise this point by introducing the very basics of complexity theory, without the goal to be complete nor rigorous.

A quick detour in complexity theory and worst case analysis : P 6Æ NP ?

Complexity theory aims at grouping into complexity classes the set of all problems that can be solved by computers. It exists a full zoology of such complexity classes. Let us formalize a bit this idea. Let us denote a generic problemby ª. For example, considering the core of this thesis, the AWGN corrupted linear estimation problem, ª would be given by ªÆ « find s from the generic model (3.18) ». Ã denotes a given realization or instance of ª which would be in the present case a particular realization of the randomnoise vector », of the sensing matrix F and of the measured signal s in (3.18).

Table of contents :

Remerciements
Abstract (English/Français)
List of figures
List of symbols and acronyms
I Main contributions and structure of the thesis
1 Main contributions
1.1 Combinatorial optimization
1.1.1 Study of the independent set problem, or hard core model on random regular graphs by the one step replica symmetry breaking cavity method 3
1.2 Signal processing and compressed sensing
1.2.1 Generic expectation maximization approximatemessage-passing solver for compressed sensing
1.2.2 Approximate message-passing for approximate sparsity in compressed sensing
1.2.3 Influence of structured operators with approximate message-passing for the compressed sensing of real and complex signals
1.2.4 « Total variation » like reconstruction of natural images by approximate message-passing
1.2.5 Compressive fluorescence microscopy images reconstruction with approximate message-passing
1.3 Coding theory for real channels
1.3.1 Error correction of real signals corrupted by an approximately sparse Gaussian noise
1.3.2 Study of the approximatemessage-passing decoder for sparse superposition codes over the additive white Gaussian noise channel
2 Structure of the thesis and important remarks
II Fundamental concepts and tools
3 Statistical inference and linear estimation problems for the physicist layman
3.1 What is statistical inference ?
3.1.1 General formulation of a statistical inference problem
3.1.2 Inverse versus direct problems
3.1.3 Estimation versus prediction
3.1.4 Supervised versus unsupervised inference
3.1.5 Parametric versus non parametric inference
3.1.6 The bias-variance tradeoff: what a good estimator is ?
3.1.7 Another source of error: the finite size effects
3.1.8 Some important parametric supervised inference problems
3.2 Linear estimation problems and compressed sensing
3.2.1 Approximate fitting versus inference
3.2.2 Sparsity and compressed sensing
3.2.3 Why is compressed sensing so useful?
3.3 The tradeoff between statistical and computationnal efficiency
3.3.1 A quick detour in complexity theory and worst case analysis : P 6Æ NP ? .
3.3.2 Complexity of sparse linear estimation and notion of typical complexity
3.4 The convex optimization approach
3.4.1 The LASSO regression for sparse linear estimation
3.4.2 Why is the `1 norma good norm?
3.4.3 Advantages and disadvantages of convex optimization
3.5 The basics of information theory
3.5.1 Incertitude and information: the entropy
3.5.2 The mutual information
3.5.3 The Kullback-Leibler divergence
3.6 The Bayesian inference approach
3.6.1 The method applied to compressed sensing
3.6.2 Different estimators for minimizing different risks: Bayesian decision theory
3.6.3 Why is the minimummean square error estimator themore appropriate ? A physics point of view
3.6.4 Solving convex optimization problems with Bayesian inference
3.7 Error correction over the additive white Gaussian noise channel
3.7.1 The power constrained additive white Gaussian noise channel and its capacity
3.7.2 Linear coding and the decoding problem
4 Mean field theory, graphical models andmessage-passing algorithms
4.1 Bayesian inference as an optimization problem and graphical models
4.1.1 The variational method and Gibbs free energy
4.1.2 The mean field approximation
4.1.3 Justification of mean field approximations bymaximumentropy criterion
4.1.4 The maximum entropy criterion finds the most probable model
4.1.5 Factor graphs
4.1.6 The Hamiltonian of linear estimation problems and compressed sensing
4.1.7 Naive mean field estimation for compressed sensing
4.2 Belief propagation and cavities
4.2.1 The canonical belief propagation equations
4.2.2 Understanding belief propagation in terms of cavity graphs
4.2.3 Derivation of belief propagation from cavities and the assumption of Bethe measure
4.2.4 When does belief propagation work
4.2.5 Belief propagation and the Bethe free energy
4.2.6 The Bethe free energy in terms of cavity messages
4.2.7 Derivation of belief propagation fromthe Bethe free energy
4.3 The approximate message-passing algorithm
4.3.1 Why is the canonical belief propagation not an option for dense linear systems over reals?
4.3.2 Why message-passing works on dense graphs?
4.3.3 Derivation of the approximate message-passing algorithm from belief propagation
4.3.4 Alternative « simpler » derivation of the approximate message-passing algorithm
4.3.5 Understanding the approximate message-passing algorithm
4.3.6 How to quickly cook an approximate message-passing algorithm for your linear estimation problem
4.3.7 The Bethe free energy for large dense graphs with i.i.d additive white Gaussian noise
4.3.8 Learning of themodel parameters by expectation maximization
4.3.9 Dealing with non zero mean measurement matrices
5 Phase transitions, asymptotic analyzes and spatial coupling
5.1 Disorder and typical complexity phase transitions in linear estimation
5.1.1 Typical phase diagram in sparse linear estimation and the nature of the phase transitions
5.2 The replica method for linear estimation over the additive white Gaussian noise channel
5.2.1 Replica trick and replicated partition function
5.2.2 Saddle point estimation
5.2.3 The prior matching condition
5.3 The cavity method for linear estimation: state evolution analysis
5.3.1 Derivation starting from the approximatemessage-passing algorithm .
5.3.2 Alternative derivation starting from the cavity quantities
5.3.3 The prior matching condition and Bayesian optimality
5.4 The link between replica and state evolution analyzes: derivation of the state evolution from the average Bethe free entropy
5.4.1 Different forms of the Bethe free entropy for different fixed points algorithms.
5.5 Spatial coupling for optimal inference in linear estimation
5.5.1 The spatially-coupled operator and the reconstruction wave propagation
5.6 The spatially-coupled approximate message-passing algorithm
5.6.1 Further simplifications for random matrices with zeromean and equivalence withMontanari’s notations
5.7 State evolution analysis in the spatially-coupled measurement operator case .
III Signal processing
6 Compressed sensing of approximately sparse signals
6.1 The bi-Gaussian prior for approximate sparsity
6.1.1 Learning of the priormodel parameters
6.2 Reconstruction of approximately sparse signals with the approximate messagepassing algorithm
6.2.1 State evolution of the algorithm with homogeneous measurementmatrices
6.2.2 Study of the optimal reconstruction limit by the replica method
6.3 Phase diagrams for compressed sensing of approximately sparse signals
6.4 Reconstruction of approximately sparse signals with optimality achieving matrices
6.4.1 Restoring optimality thanks to spatial coupling
6.4.2 Finite size effects influence on spatial coupling performances
6.5 Some results on real images
6.6 Concluding remarks
7 Approximate message-passing with spatially-coupled structured operators
7.1 Problem setting
7.1.1 Spatially-coupled structured measurement operators
7.1.2 The approximate message-passing algorithm for complex signals
7.1.3 Randomization of the structured operators
7.2 Results for noiseless compressed sensing
7.2.1 Homogeneous structured operators
7.2.2 Spatially-coupled structured operators
7.3 Conclusion
8 Approximate message-passing for compressive imaging
8.1 Reconstruction of natural images in the compressive regime by « total-variationminimization »- like approximate message-passing
8.1.1 Proposed model
8.1.2 The Hadamard operator and the denoisers
8.1.3 The learning equations
8.1.4 Numerical experiments
8.1.5 Concluding remarks
8.2 Image reconstruction in compressive fluorescence microscopy
8.2.1 Introduction
8.2.2 Experimental setup and algorithmic setting
8.2.3 A proposal of denoisers for the reconstruction of point-like objects measured by compressive fluorescence microscopy
8.2.4 Optimal Bayesian decision for the beads locations
8.2.5 The learning equations
8.2.6 Improvement using small first neighbor mean field interactions between pixels
8.2.7 Reconstruction results on experimental data
8.2.8 Concluding remarks and open questions
IV Coding theory
9 Approximatemessage-passing decoder and capacity-achieving sparse superposition codes
9.1 Introduction
9.1.1 Related works
9.1.2 Main contributions of the present study
9.2 Sparse superposition codes
9.3 Approximate message-passing decoder for superposition codes
9.3.1 The fast Hadamard-based coding operator
9.4 State evolution analysis for random i.i.d homogeneous operators with constant power allocation
9.5 State evolution analysis for spatially-coupled i.i.d operators or with power allocation
9.5.1 State evolution for power allocated signals
9.6 Replica analysis and phase diagram
9.6.1 Large section limit of the superposition codes by analogywith the random energy model
9.6.2 Alternative derivation of the large section limit via the replica method . 221
9.6.3 Results from the replica analysis
9.7 Optimality of the approximate message-passing decoder with a proper powerallocation
9.8 Numerical experiments for finite size signals
9.9 Concluding remarks
xv
Contents
10 Robust error correction for real-valued signals via message-passing decoding and
spatial coupling 239
10.1 Introduction
10.2 Compressed sensing based error correction
10.2.1 Performance of the approximate message-passing decoder
10.3 Numerical tests and finite size study
10.4 Discussion
Bibliography 259