Consistency of the maximum likelihood and variational estimators in a dynamic stochastic block model

Get Complete Project Material File(s) Now! »

Some statistical uses of random graphs

As mentioned before, graphs can be used in many contexts, to tackle various problems and using many different methods. Their use goes back as far as the late 1930’s with Moreno and Jennings (1938) in the context of social relationships (sociometry). We give a few (not exhaustive) examples. Descriptive analysis One can simply be interested in some structural properties or characteristics of an observed graph in order to answer questions about it. For example, to try to answer the question « Do friends (i.e. neighbours in the graph) of a given individual tend to be friends of one another? », we can compute transitivity measures. Such measures can be the transitivity coefficient, counting the proportions of triangles (i.e. sets of three nodes connected by three edges) among connected triplets (i.e. sets of three nodes connected by at least two edges). A high transitivity coefficient means that the connected subgraphs of order three tend to form triangles, i.e. « the friend of your friend tends to be your friend ». More generally we can describe the tendency of the nodes in a network to form cliques or groups of highly connected nodes based on clustering measures. A clustering coefficient can be defined in many ways, for example based on a local density measure, defined for any node i as the proportion of present edges among possible edges between the neighbours of node i. Taking the mean of these local density measures for every node gives a clustering coefficient. The transitivity coefficient introduced above can also be used as a clustering coefficient.
Another question can be « How important in the network is a given node? », important meaning that it has a lot of connections, and this characteristic can be described by centrality measures. Many centrality measures have been defined, and can be based for example on the degrees of the nodes or on the distance between a node and all the other nodes.
Some connectivity measures can also be defined, such as the number of connected components or their relative order with respect to the whole graph. The average shortest path length, answering the question « What is the typical distance between two nodes? », can in particular determine if the observed network exhibit the small world property. This property, based on the suggestion of Milgram (1967) that we are separated from any other person on the planet by at most roughly six other people, refers to the fact that in some networks (even large ones), the distance between any two nodes is relatively small. It has been shown to hold for many types of networks by Watts and Strogatz (1998).
Comparison to a null model Using generative models of graph, we can assess network topology. We can compare the observed network with graphs generated from random models, for some chosen features. The models used can be for example the simple Erdős-Rényi graph model (see Section 1.3) which assumes that every edge has the same probability of presence, or a model of random graphs with preserved degree sequence (see the fixed degree sequence configuration model in Section 1.3). The compared characteristics can be for example the reciprocity (as in Moreno and Jennings (1938) who compared the fraction of reciprocated links in their network with a random model), a clustering coefficient or the importance of occurrence of certain motifs in the graphs (for example triangles, cliques or cycles of a given size). It is possible to compute empirical p-values to perform hypothesis testing for the characteristic of interest, with H0 the null hypothesis corresponding to the null model. See for example Zweig et al. (2016) for more details and references for the use of null models for random graphs.

Barabási–Albert model (Preferential Attachment)

As mentioned in the previous section, the Barabási–Albert model, based on a preferential attachment mechanism, is a generative network growth model, the graph growing at each step of the algorithm. It was introduced by Barabási and Albert (Barabási and Albert, 1999) and motivated by the growth of the World Wide Web.
The generation of a graph is as follows. We start with an initial graph G0 = (V0, E0). Then, at each step, a new node of degree m ≥ 1 is added to the network. It is connected to m existing nodes with a probability that is proportional to the degrees of the existing nodes.

Exponential Random Graph Models (ERGM)

ERGMs, also called p∗ models, (see Strauss and Ikeda (1990); Wasserman and Pattison (1996); Hunter and Handcock (2006); Snijders et al. (2006), and also Frank and Strauss (1986) who introduced Markov random graphs, a particular sub-class of exponential random graph models) are based on exponential families of distributions. For a survey of this model, one can refer to Robins et al. (2007a); Wasserman and Robins (2005); Goldenberg et al. (2010), or to van der Pol (2019) for a more recent reference. This model assumes that the probability of a graph (i.e. of its adjacency matrix X) can be explained by a statistic S(X) as follows Pθ(X) = c(1θ) exp θ⊤S(X) .
where c(θ) is a normalising constant, that we generally cannot compute in practice. ERGMs are flexible models in the sense that they can be based on different types of statistics, according to the type of pattern or behaviour we are interested in. For example S can be or can include density-related statistics (for example the number of edges), degree-based statistics, number of triangles or cycles, etc. These models are in particular widely used in social networks (Robins et al., 2007a,b; Lusher et al., 2013). Indeed, an important feature of ERGMs is that the edges are dependent, which is appropriate to model social networks, in which it is not reasonable to assume independence of the relations. Note that such a model can be defined for both directed and undirected graphs. Note that an advantage of the ERGM is that it allows the inclusion of covariates (on the nodes), also called actor attributes (Robins et al., 2001), influencing the nodes connection behaviour. It is also possible to add covariates on pairs of nodes. An extended version of the ERGM can also be defined for weighted graphs (Robins et al., 1999; Desmarais and Cranmer, 2012; Krivitsky, 2012), and for multigraphs (Pattison and Wasserman, 1999).
When interested in the estimation of the parameter for the ERGM, it can be done by approximating the maximum likelihood, but there is a number of issues with this method, namely degeneracy (or near degeneracy) issues (Handcock, 2003; Handcock et al., 2003; Hunter et al., 2008; Rinaldo et al., 2009; Schweinberger, 2011; Chatterjee et al., 2013). Handcock et al. (2003) defines this issue as occurring when only a few graphs do not have a very low probability, these graphs often being the full graph and the empty graph. Then, such models are not interesting for modeling real networks, and in addition, degeneracy is often associated with poor properties of estimation methods based on the likelihood, such as Markov chain Monte Carlo (MCMC) procedures.
An approximation of the maximum likelihood is required, because it is not tractable in this model. Some methods have been introduced, such as the use of the pseudolikelihood (Besag, 1975; Strauss and Ikeda, 1990) where the joint distribution is replaced by the product of the conditional distributions, but this method has been shown to not behave well, depending on the network (Wasserman and Robins, 2005; Robins et al., 2007b; Van Duijn et al., 2009).
Another type of methods for this purpose, perhaps more used, is the use of MCMC techniques (Snijders, 2002; Hunter and Handcock, 2006; Handcock, 2003) in order to approximate the maximum likelihood estimator, for example by estimating likelihood ratios, using a Metropolis-Hastings algorithm to generate a sample of networks simulated from the ERGM. As mentioned earlier, such methods can behave badly, converging to degenerate graphs7 or failing to converge. New faster sampling techniques have been introduced (Stivala et al., 2020), allowing to perform these MCMC procedures more efficiently and on larger networks. Other methods for large graphs include computing estimators on snowball samples (Goodman, 1961) from the network (Pattison et al., 2013; Stivala et al., 2016).
Some methods have been introduced to tackle the difficulties due to degeneracy, for example the use of curved ERGMs (Hunter and Handcock, 2006; Hunter, 2007) which generalises the ERGM (see Efron (1975, 1978)). In particular Snijders et al. (2006) introduced a particular specification of ERGMs for the analysis of social networks, with new statistics to represent structural properties such as transitivity and heterogeneity of degrees, and which solves some of the problems of degeneracy. Caimo and Friel (2011) propose a MCMC algorithm in a Bayesian framework and give empirical evidence that the method quickly converges. Schweinberger and Handcock (2015), Schweinberger and Stewart (2020) and Schweinberger (2020) propose ERGMs with local dependence, adding an additional structure to the network by assuming that the graph nodes can be partitioned into subgraphs, and that dependence exists within subgraphs but not between subgraphs. They obtained consistency results in that context. Some other theoretical results have been obtained, such as Mukherjee (2020) who obtained a sufficient criterion for non-degeneracy for ERGMs on sparse graphs.

Node clustering techniques

In this section, we will review some techniques used to cluster vertices in a graph into groups of nodes with similar connection behaviour, and give some theoretical or empirical results for these techniques. Node clustering in networks has been intensively studied and we will not be able to be exhaustive, but will present some of the widely used methods. In particular, a lot of work has been done on community detection, i.e. when considering groups of vertices that have a high within-group connectivity and a low between-group connectivity, as on the left of Figure 1.4. Some methods can identify a more general structure with groups of nodes sharing a similar connection behaviour towards other nodes (in the same class or in any other class). For example, as in the middle of Figure 1.4, where we observe two classes, a group of two hubs and a class of peripheral nodes, or as in the right of Figure 1.4 where we observe groups of nodes interacting mainly with nodes from other groups. Note that the term community is often used to talk about these more general classes, but we will stick to the strict definition of a community in this work. We will then refer to community detection when doing clustering for community structures and simply to clustering for the clustering of more general structures.

READ The Road to Strongly Interacting Bose Gases

Maximum likelihood estimation in the SBM with a Variational EM algorithm

In the SBM, we can be interested in estimating the parameters of the model in order to describe our network. We can obtain them by using a modified version of the Expectation-Maximisation (EM) algorithm based on a variational approximation of the distribution of the latent variables given the observations. If we also want to obtain a clustering, we can then obtain the latent groups by using a maximum a posteriori estimation.
In this model, we cannot compute the Maximum Likelihood Estimator (MLE) except for very small values of n, because it involves a summation over all the Qn possible latent configurations. Indeed, the log-likelihood is ℓ(θ) := log Pθ(X) = log Pθ(X, Z = z) = log elog Pθ(X,Z=z) z∈ 1,Q n z∈ 1,Q n. where log Pθ(X, Z = z) = log Pθ(X | Z = z) + log Pθ(Z = z) = ZiqZjl [Xij log πql + (1 − Xij) log(1 − πql )] 1≤q,l≤Q 1≤i<j≤n + Q n Ziq log αq, (1.4.2).
defining Ziq = 1Zi=q for every i and q. We neither can use the EM algorithm (often used in latent variables models) to approximate it because it involves the computation of the conditional distribution of the latent variables given the observations which is not tractable. A common solution is to use the Variational Expectation-Maximisation (VEM) algorithm that optimises a lower bound of the log-likelihood (see for example Daudin et al. (2008)).
EM algorithm Let us first describe briefly the EM algorithm in a general case. This is an iterative algorithm introduced by Dempster et al. (1977), used to approximate maximum likelihood estimates of parameters in statistical models, where the model depends on unobserved latent variables. Assume that we observe the variable X = (X1, . . . , Xn), that Z = (Z1, . . . , Zn) is a set of latent (i.e. unobserved) variables taking their values in a finite set and that θ is a vector of unknown parameters. We want to maximise the log-likelihood ℓ(θ) = log Pθ(X, Z = z).

Time-evolving networks

The first part of this work is devoted to dynamic (or time-evolving) networks, where the role or behaviour of the nodes in the network and the relationships between them are allowed to change over time. See Holme (2015) for an introduction to dynamic networks. These types of networks arise in many domains. Some obvious examples of dynamic networks are human contacts or proximity networks (obtained by recording when two people are close to each other) (Barrat and Cattuto, 2013), communication networks (such as e-mails or phone calls between people) (Saramäki and Moro, 2015; Ebel et al., 2002) or social networks (such as friendship networks). Another example is the transportation networks (for instance based on airlines connection or public transportation). Such networks also arise in neuroscience (for example networks representing the temporal correlations between brain regions based on functional magnetic resonance imaging (fMRI) data) (Sizemore and Bassett, 2018) or more generally in biology.
Such types of networks have been widely studied and can take many different forms. For example, we can consider discrete or continuous time, the interactions can be instantaneous or have a duration, we can assume that the nodes are present the entire observation time (and only the edges are evolving) or not, etc. We describe such networks in the following, distinguishing two main types, discrete-time (which will be our interest in this work) and continuous-time networks. We also present briefly some existing models for dynamic networks. The model we are interested in, that is a discrete dynamic version of the SBM, will be described in Section 1.6, and some more discrete dynamic graph models based on the (static) SBM are presented in the introduction of Chapter 2. One can also refer to Kim et al. (2018) for a review of latent variables dynamic network models.

Discrete-time dynamic networks

A possibility when studying dynamic networks is to work in discrete time, i.e. to look at snapshots, or aggregated relational data over time ranges, in order to get a sequence of graphs. The data can be aggregated by splitting the observation period into T intervals and considering that an edge is present in the interval t ∈ 1, T if it is present at any time in this interval to obtain a sequence of binary graphs. We can also consider a sequence of weighted graphs by doing as follows. In the case of instantaneous interactions between nodes, the edge between two nodes in the interval t ∈ 1, T can represent the number of interactions observed in this interval. In the case of interactions of variable lengths, the edge between two nodes in the interval t ∈ 1, T can represent the time of the interaction observed in this interval. Aggregating data obviously leads to a loss of information, and is not adapted to any kind of network.
An adapted representation of discrete-time dynamic networks is a sequence of graphs, as in Figure 1.6. See also Figure 1.7 for some representations of real world dynamic networks.
It is obviously important to take into account the evolutionary behaviour of the graphs, instead of just studying separate snapshots as unique graphs, the graphs being dependent over time. A lot of models for discrete-time dynamic networks have been introduced. A common approach is to define dynamic extensions of existing (static) models. As mentioned before, discrete dynamic graph models based on the (static) SBM are introduced in Section 1.6 and in the introduction of Chapter 2. Note that some variants of the static SBM have also been extended to the dynamic case, for example the MMSBM (see for instance Xing et al. (2010)). Such models can be used for node clustering, as their static versions. Extensions of the ERGM have also been introduced for dynamic networks. For example, Hanneke and Xing (2007); Hanneke et al. (2010) introduce the Temporal Exponential Random Graph Model (TERGM), for the purposes of studying social network evolution. In the TERGM, a Markov assumption is made on the evolution of the network, assuming that the graph at time t given the graph at time t − 1 follows an ERGM, the statistic S(Xt, Xt−1) involving Xt and Xt−1, the adjacency matrices at times t and t − 1. The statistics involving both Xt and Xt−1 can include stability15, reciprocity16 or transitivity17, these measures being particularly relevant in the context of social networks. See also Krivitsky and Handcock (2014), who introduced the Separable Temporal Exponential Random Graph Model (STERGM), adding an assumption of separability between the formation and duration of edges, obtaining a more convenient model. These models can be used for clustering (for example Lee et al. (2020) introduce a model-based clustering method for time-evolving networks based on a finite mixture of discrete time exponential-family random graph models).
Dynamic extensions of the LPM have also been introduced (Sarkar and Moore, 2006; Sewell and Chen, 2015; Friel et al., 2016; Sewell and Chen, 2016). For example, the model of Sarkar and Moore (2006) assumes that the nodes can move in the latent space between two time steps, with a Markov assumption on the movement of the nodes, and assuming that the observed graphs are independent given the locations of the nodes. Friel et al. (2016) introduce a model for the analysis of bipartite networks, extending the model of Sarkar and Moore (2006) by adding temporal evolution through Markovian dependence on the model parameters and on the edges (the edges are not conditionally independent anymore). Sewell and Chen (2016) extends the LPM for dynamic weighted graphs.

Table of contents :

1 Introduction
1.1 Graphs: definitions and notations
1.2 Some statistical uses of random graphs
1.3 Random graph models
1.3.1 Erdős-Rényi graph model
1.3.2 Configuration model
1.3.3 Barabási–Albert model (Preferential Attachment)
1.3.4 Exponential Random Graph Models (ERGM)
1.3.5 Stochastic Block Model (SBM)
1.3.6 Latent Block Model (LBM)
1.3.7 Latent Position Model (LPM)
1.3.8 Graphon and W-graph model
1.4 Node clustering techniques
1.4.1 Spectral clustering
1.4.2 Modularity
1.4.3 Other community detection methods
1.4.4 Model-based clustering
1.4.5 Choice of the number of classes
1.4.6 Theoretical results
1.5 Time-evolving networks
1.5.1 Discrete-time dynamic networks
1.5.2 Continuous-time dynamic networks
1.6 Dynamic SBM with Markov membership evolution
1.6.1 Label switching and identifiability in the dynamic SBM
1.6.2 Estimation
1.6.3 Contributions in the dynamic SBM
1.7 Markov Random Fields (MRF)
1.7.1 Definition
1.7.2 Autologistic model and Potts model
1.7.3 Strength of interaction and phase transition
1.7.4 Simulation with a Gibbs sampler
1.7.5 Likelihood estimation and approximations
1.7.6 Hidden Markov random field
1.7.7 EM with mean field or mean field like approximation
1.7.8 Other methods
1.7.9 Choice of the number of classes for hidden MRF
1.8 Space-evolving networks
1.8.1 Contributions in the spatial SBM
2 Consistency of the maximum likelihood and variational estimators in a dynamic stochastic block model
2.1 Introduction
2.2 Model and notation
2.2.1 Dynamic stochastic block model
2.2.2 Assumptions
2.2.3 Finite time case
2.2.4 Likelihood
2.3 Consistency of the maximum likelihood estimator
2.3.1 Connectivity parameter
2.3.2 Latent transition matrix
2.4 Variational estimators
2.4.1 Connectivity parameter
2.4.2 Latent transition matrix
2.5 Proofs of main results
2.5.1 Proof of Theorem 2.3.1
2.5.2 Proof of Corollary 2.3.1
2.5.3 Proof of Theorem 2.3.2
2.5.4 Proof of Theorem 2.3.3
2.5.5 Proof of Corollary 2.3.3
2.5.6 Proof of Theorem 2.4.1
2.5.7 Proof of Corollary 2.4.1
2.5.8 Proof of Theorem 2.4.2
3 Estimation of parameters in a space-evolving graph model based on Markov random fields
3.1 Introduction
3.2 Model and notation
3.2.1 Definition of the model
3.2.2 Assumptions
3.3 Identifiability
3.4 Estimation
3.4.1 Likelihood
3.4.2 Maximum likelihood approach
3.4.3 EM algorithm
3.4.4 Mean-field like approximation
3.4.5 Simulated EM Algorithm
3.4.6 Step 1: Simulation of a configuration for the mean-field like approximation
3.4.7 Step 2: EM iteration
3.4.8 Initialisation and stopping criterion of the algorithm
3.5 Illustration of the method on synthetic datasets
3.6 Proofs
Conclusions and perspectives
References