Get Complete Project Material File(s) Now! »

## MARKOV RANDOM FIELDS

Graphical models are a powerful paradigm for multivariate statis-tical modeling. They allow to encode information about the condi-tional dependencies of a large number of interacting variables in a compact way, and provide a unified view of inference and learning problems in areas as diverse as statistical physics, computer vision, coding theory or machine learning (see e. g. [86, 151] for reviews of applications). In this first chapter, we motivate and introduce in sec-tion 1.1 the formalism of undirected graphical models, often called Markov random fields (MRFs). We then focus in section 1.3 on the particular case of pairwise MRFs which will be particularly important in the following. The analyses of this dissertation will apply to mod-els drawn from certain random graph ensembles, and section 1.4 is devoted to a review of some of their basic properties. Computing the marginals of pairwise MRFs is computationally hard, and we will systematically resort to mean-field approximations, reviewed in sec-tion 1.5. We next turn to a presentation of the Ising model and its variants, and give in section 1.6 a high level description of their phase diagram, which will guide our intuition in the following. We finally outline in section 1.7 the general strategy we will apply to tackle the various problems in this dissertation.

We note that while MRFs are particularly well-suited to represent non-causal probabilistic interactions between variables, they are not the only example of graphical models. In particular, we will not address Bayesian networks 1 , which are defined on directed acyclic graphs. The reader is referred to [86] for a comprehensive introduc-tion to graphical models.

**MOTIVATION**

One of the first goals of a graphical model is to give a high-level, compact representation of a joint probability distribution (JPD). Far from being of purely academic concern, this problem has a consider-able practical importance when dealing with the distribution of even a moderate number of correlated random variables. Imagine for in-stance having to represent the JPD of 300 binary variables. In gen-eral, such a JPD is specified by 2300 numbers — the probabilities of the different assignments of all the variables. This already requires

J1,2 … Conditioning hi0−1 hi0+1

σ1 σ2 σi0 σn σ1 σi0−1 σi0+1 σn

Figure 1.1 – Graphical model associated with the Ising chain 1.1. After con-ditioning on the value of the spin σi0 , the JPD splits into two independent parts, with the addition of local fields hi0−1 and hi0+1 acting on the neighbors of σi0 .

more bits than the total number of atoms in the observable universe 2 . In practical applications, it is common to face millions or billions of variables, with hundreds of possible values for each of them. For instance, in computer vision, the probabilistic analysis of a single megapixel image in grayscale requires dealing with the JPD of more than a million variables (the pixels), with 256 values for each one of them (their intensities).

Fortunately, it is often the case that there is some kind of struc-ture in the distribution of the variables, which can be stated in terms of (conditional) independence properties. As a simple example, con-sider the case of a one-dimensional Ising model on a chain of length n ∈ N, defined by the JPD P(σ) = exp Ji,i+1σiσi+1 , (1.1) where σ = (σi)i∈[n] ∈ {±1}n is a set of binary spins, and the Ji,i+1 ∈ R for i ∈ [n − 1] are called couplings. Z is a normalization called the partition function. Let us consider the effect of conditioning on a given spin σi0 . We define two fields hi0−1 = Ji0−1,i0 σi0 and hi0+1 = Ji0,i0+1σi0 which summarize the influence of σi0 on its neighbors. We can then write P(σ | σi0 ) = P(σ1, . . . , σi0−1 | σi0 )P(σi0+1, . . . , σn | σi0 ) , (1.2)

where P(σ1, . . . , σi0−1 | σi0 ) ∝ exp i0−2 Ji,i+1σiσi+1 + hi0−1σi0−1 , (1.3)

This property, depicted on figure 1.1, means that the spins that are left of σi0 in the chain are independent of the spins that are right of 1080 ≈ 2266. σi0 , when conditioned on σi0 . It is a special case of a more general global Markov property, which we state in the next section.

Markov random fields exploit such conditional independence prop-erties of a JPD to provide a compact, graphical representation of the distribution. In the following section, we introduce MRFs as a gen-eralization of the Ising chain example, and state their independence properties.

### GENERAL MARKOV RANDOM FIELDS

Throughout the dissertation, we let σ = (σi)i∈[n] ∈ Xn denote n random variables, jointly distributed according to the JPD P. We will assume that the alphabet X is finite, although many of the results of the next section can be naturally extended to real random variables. Our first aim is to find a compact representation of P that highlights the conditional dependencies between the n variables.

We consider n random variables

(σi)i∈[n] over a finite alphabet X

**Definition and Markov property**

In the Ising chain example of the previous section, global correlations between spins arise as a consequence of local interactions between neighboring spins on the chain. Such a structural property can be generalized, and more complex interactions can be modeled by replacing the chain with a general graph G = (V, E), where V = [n] is the set of vertices, representing the random variables σi for i ∈ [n], and the set of edges E represents direct interactions between the random variables. We allow for interactions involving more than two variables by defining compatibility functions on the cliques of the graph.

A clique of G is a fully connected subgraph of G, i. e. a subset C ⊂ V Cliques of the vertices such that for any i, j ∈ C, (ij) ∈ E. We associate with each clique C a compatibility function, or potential, ψC : X|C| → R+, Potentials where |C| is the number of vertices in the clique C. Generalizing exam-

ple (1.1), we define a Markov random field (MRF), also called undirected where C is the set of cliques of the graph G, and the partition function Partition function Z ensures that the distribution is normalized. Note that we recover the Ising chain example of the previous section by taking G to be the chain graph with edge set E = (i, i + 1)i∈[n−1], and noticing that the cliques of G consist of pairs of neighboring spins. The corresponding compatibility functions are, for i ∈ [n − 1], ψi,i+1(σi, σi+1) = exp Ji,i+1σiσi+1 . (1.6)

P(σ) ∝ ψ123(σ1, σ2, σ3)ψ34(σ3, σ4)ψ4567(σ4, σ5, σ6, σ7), with its factor graph representation on the right.

The definition (1.5) is justified by the following global Markov property, that generalizes the conditional independence property (1.2) of the Ising chain. Let A, B, S ⊂ V be three disjoint sets of nodes of the graph G. Assume that S separates A and B, in the sense that any path in the graph G linking a node in A and a node in B must include a node in S. Then any MRF, i. e. any distribution of the form (1.5) verifies P (σi)i∈A∪B | (σi)i∈S = P (σi)i∈A | (σi)i∈S P (σi)i∈B | (σi)i∈S . (1.7)

The factorization (1.5) can therefore efficiently encode complex interactions and independence properties between a large number of interactions. In fact, under a small additional assumption, it is possible to show that a JPD that verifies the global Markov property with respect to a graph G must factorize in the form (1.5). This is the content of the Hammersley-Clifford theorem which we recall here.

Theorem 1.2.1. (Hammersley and Clifford, 1971) Let P be a distribution over the set Xn verifying the global Markov property (1.7) with respect to a graph G. Assume that P(σ) > 0 for any σ ∈ Xn. Then there exist compatibility functions ψC indexed by the cliques of the graph G such that P factorizes into the form (1.5).

A consequence of this theorem is that, as an alternative to the defini-tion (1.5), (positive) MRFs can equivalently be defined as distributions verifying the global Markov property, a definition often chosen in the literature. In fact, the Hammersley-Clifford theorem also implies the equivalence between the global Markov property and so-called local Markov properties ([86]), any of which can be used as a definition for MRFs.

**Factor graph representation**

One of the main advantages of MRFs is that they can be conveniently represented graphically (see left panel of figure 1.2 for an example). However, to recover the factorization associated with a given graphi-cal representation, one has to look for the cliques of the graph, which can be cumbersome, especially when many variables are involved. An alternative visual representation of the same factorization prop-erty is given by factor graphs, which explicitly depict the interactions between variables. A factor graph is formally defined as a bipartite graph G = (V, F, E) where V is the set of variable nodes, F is the set of factor nodes, representing the potentials, and E is the set of edges. To represent a MRF of the form (1.5) as a factor graph, we associate with each variable σi (i ∈ [n]) a variable node (represented as a circle on figure 1.2), and with each potential ψC a factor node (represented as a square on figure 1.2). We then link each variable to the factors that depend on it.

**First examples**

Factor graphs give an explicit and convenient representation of the conditional independencies of a MRF, and are extensively used in many fields. Let us illustrate their use on two examples, repre-sented graphically on figure 1.3. The first example is drawn from statistical physics, and is called the p-spin model. It is a generalization of the Ising model (1.1) with p-body interactions (p > 2) between the spins, instead of only pairwise interactions. The probability of a spin configuration σ ∈ {±1}n in the 3-spin model on a chain is given by P(σ) = exp Jiσi−1σiσi+1 . (1.8)

The factor graph associated with this distribution is depicted on fig-ure 1.3.

A second example, from the field of theoretical computer science, is that of satisfiability. A satisfiability problem (or SAT problem for short) is specified by n boolean variables (σi)i∈[n] ∈ {true, false}n, and a formula, defined as the logical AND of a certain number of clauses that the variables must satisfy. An example of a SAT formula over n = 6 variables is (σ1 ∨ ∨σ4)∧(σ3∨ )∧(σ4∨σ5∨ ) , (1.9) where σi denotes the negation of the variable σi. The problem of finding an assignment σ of the variables that satisfies a given SAT for-mula is equivalent to finding the configurations with non-vanishing probability in an associated MRF. For instance, the MRF associated with the SAT formula (1.9) is P(σ) = 1 1(σ1 ∨ ∨ σ4)1(σ3 ∨ )1(σ4 ∨ σ5 ∨ ) . (1.10)

The associated factor graph is shown on figure 1.3.

These two examples barely scratch the surface of the extensive use of factor graphs in many problems. Whenever the problem at hand requires performing inference, e. g. computing the marginals or the modes of a high dimensional distribution, the fist step in finding a so-lution is usually to draw a factor graph representation of the problem, and to use it to derive approximate inference algorithms, e. g. belief propagation (see section 1.5). This approach is standard in problems as diverse as error-correcting coding [129], compressed sensing [106], computer vision [119], and many more [86, 151].

In the following, we will be mainly interested in the special case of pairwise MRFs, both in the interest of simplicity and because the original contributions of this dissertation are naturally expressed in terms of pairwise models. In the next section, we motivate the use of pairwise models, and show that they can in fact be used to represent any MRF.

#### PAIRWISE MARKOV RANDOM FIELDS

**Definition and examples**

An interesting particular case of the general MRF (1.5) is obtained by setting trivial compatibility functions ψC((σi)i∈C) = 1, for all cliques with size |C| > 2. On a given graph G = ([n], E), the resulting model, called a pairwise MRF, reads where we have separated the single-variable potentials (φi)i∈[n] as-sociated with cliques of size |C| = 1, from the pairwise potentials (ψij)(ij)∈E associated with cliques of size |C| = 2. Note that for convenience, we are slightly abusing our notations by indexing the pairwise potentials, defined on each edge e ∈ E, by indices i, j ∈ [n] such that e = (ij). Since the edge (ij) is the same as the edge (ji), for equation (1.11) to be consistent regardless of the labeling of the edges, the pairwise potentials must verify the following symmetry

ψij(σi, σj) = ψji(σj, σi) ∀(ij) ∈ E, σi, σj ∈ X . (1.12)

When the distribution (1.11) is strictly positive for any assignment σ, it admits the following exponential representation P(σ) = − ǫij(σi, σj) − ǫi(σi)

where the energies ǫi, ǫij for i ∈ [n] and (ij) ∈ E are defined as ǫi(σi) = − log φi(σi) , ǫij(σi, σj) = − log ψij(σi, σj) . (1.14)

The choice of the word “energy” highlights the connection with statis-tical physics. Indeed, expression (1.13) is nothing but the Boltzmann distribution of the so-called Potts model of statistical physics [159].

We will represent pairwise models of the form (1.11) by simply drawing the graph G = ([n], E), in which each node (pictured as a white circle) represents a variable, and each edge (ij) ∈ E carries a pairwise potential ψij. The single-variable potentials φi for i ∈ [n] are pictured as black circles connected to a single variable node. We stress that while the graph G may contain cliques of size larger than 2, they are not associated with potentials in a pairwise MRF model. For instance, the triangle graph in the margin represents the factorization P(σ) ∝ ψ12(σ1, σ2)ψ13(σ1, σ3)ψ23(σ2, σ3)φ3(σ3) , (1.15) which does not include a term of the form ψ123(σ1, σ2, σ3).

The Potts model (1.13) is popular in computer vision [21, 22, 54], where it is used to denoise a partially observed image by encouraging nearby pixels to have similar colors. In this case, the graph G is taken to be a regular lattice where each node, representing a pixel, is connected with its four adjacent pixels in the image. The single-variable energies ǫi(σi) are chosen to be minimized when σi = σˆi, where σˆi is the observed value of the pixel. One simple example is ǫi(σi) = −h 1(σi = σˆi) , (1.16)

where h > 0 can be interpreted physically as a local field, quantifying the confidence that we have in the observed value of the pixel. The pairwise energies ǫi,j(σi, σj) act as smoothing terms, and are chosen to be minimized when σi = σj, e. g. ǫij(σi, σj) = −J 1(σi = σj) , (1.17) where J > 0 is a coupling controlling the strength of the smoothing.

The resulting graphical model is depicted on figure 1.4.

Pairwise models of the form (1.11) provide a principled approach to the probabilistic analysis of data in which we expect correlations to arise as a consequence of two-body interactions between random vari-ables. An interesting application in biology is the so-called direct cou-pling analysis [109] of co-evolution in protein sequences.

**Table of contents :**

**i inference on graphs **

**1 markov random fields **

1.1 Motivation

1.2 General Markov random fields

1.2.1 Definition and Markov property

1.2.2 Factor graph representation

1.2.3 First examples

1.3 Pairwise Markov random fields

1.3.1 Definition and examples

1.3.2 Maximum entropy principle

1.3.3 From general to pairwise Markov random fields

1.4 Random graph ensembles

1.4.1 The Erd˝os-Rényi model

1.4.2 The stochastic block model

1.4.3 The labeled stochastic block model

1.5 Mean-field approximations

1.5.1 Intractability of exact inference

1.5.2 Variational mean-field

1.5.3 The Bethe approximation and belief propagation

1.6 The Ising model: some physics

1.6.1 A different approach to mean-field approximations

1.6.2 The ferromagnetic Ising model

1.6.3 The Curie-Weiss model

1.6.4 The Ising spin glass

1.6.5 The planted spin glass

1.6.6 The Hopfield model

1.7 General approach of this work

**ii spectral inference **

**2 factorized and symmetric models **

2.1 Phase transition in belief propagation

2.1.1 Factorized models

2.1.2 Trivial fixed point of belief propagation

2.1.3 Phase transition

2.2 Symmetric models

2.2.1 Tensor decomposition and the non-backtracking matrix

2.2.2 Reduction to an Ising model

2.2.3 An example

2.2.4 A linearized belief propagation algorithm

2.3 The free energy point of view

2.3.1 The Ihara-Bass formula

2.3.2 The Bethe Hessian

2.3.3 Free energy Hessians

2.4 Conclusion

**3 spectral properties of the non-backtracking operator and the bethe hessian on random graphs **

3.1 The non-backtracking operator

3.1.1 Non-backtracking operator of the symmetric labeled stochastic block model

3.1.2 Cavity approach to the stability of the paramagnetic fixed point

3.1.3 Non-backtracking spectrum of the labeled stochastic block model

3.1.4 Cavity approach to the spectral density of the non-backtracking operator

3.2 The Bethe Hessian

3.2.1 Cavity approach to the spectral density of the Bethe Hessian on the labeled stochastic block

model

3.2.2 Ihara-Bass type formulas

3.3 Conclusion

**4 case study: the planted spin glass, or censored block model **

4.1 The non-backtracking operator

4.2 The Bethe Hessian

4.3 Algorithms

4.4 Spectral properties of the non-backtracking operator

4.5 Conclusion

**iii some applications **

**5 community detection and the ising ferromagnet**

5.1 The non-backtracking operator

5.2 The Bethe Hessian

5.3 Algorithm and numerical simulations

5.4 Conclusion

**6 randomized clustering from sparse measurements **

6.1 Motivation and definition of the problem

6.2 Algorithms

6.2.1 Belief propagation

6.2.2 The non-backtracking operator

6.2.3 The Bethe Hessian

6.3 Numerical results

6.4 Spectral properties of the non-backtracking operator

6.5 Conclusion

**7 randomized semi-supervised clustering **

7.1 Algorithm and guarantee

7.1.1 Algorithm for 2 clusters

7.1.2 Model and guarantee

7.1.3 More than 2 clusters

7.2 Numerical simulations

7.3 Proof of theorem 7.1.1

7.4 Proof of theorem 7.1.2

7.5 Conclusion

**8 matrix completion and the hopfield model **

8.1 Problem definition and relation to other work

8.2 Algorithm and motivation

8.2.1 The MaCBetH algorithm

8.2.2 Motivation from a Hopfield model

8.3 Analysis of performance in detection

8.3.1 Analysis of the phase transition

8.3.2 Computation of the spectral density

8.4 Numerical tests

8.5 Conclusion

**9 spectral bounds on the ising ferromagnet **

9.1 Introduction

9.1.1 Significance and prior work

9.1.2 Assumptions and definitions

9.2 Main results

9.2.1 Bound on the partition function

9.2.2 Bound on the susceptibility

9.2.3 Relation to belief propagation and susceptibility propagation

9.3 Proofs

9.3.1 High temperature expansion

9.3.2 Proof of theorem 9.2.1

9.3.3 Proof of theorem 9.2.2

9.4 Conclusion

**10 outlook: learning and the natural gradient **

10.1 Learning in exponential models

10.2 Natural gradient

10.3 Mean-field approximation to the natural gradient

10.4 Conclusion

**concluding remarks **

**appendix **

**bibliography**