Get Complete Project Material File(s) Now! »

## Sampled data in the SBM framework

The sampled data is an n×n matrix with entries in {0, 1, NA}. It corresponds to the adjacency matrix Y where unobserved dyads have been replaced by NA’s. More formally, let R be the n × n sampling matrix recording the data sampled during this process, such that Rij = 1 if Yij is observed and 0 otherwise; also define Do = {(i, j) : Rij = 1}, Dm = {(i, j) : Rij = 0}, Yo = {Yij : (i, j) 2 Do} and Ym = {Yij : (i, j) 2 Dm} to denote the sets of variables respectively associated with the observed and missing data. The number of nodes n is assumed to be known. The sampling design is the description of the stochastic process that generates R. It is assumed that the network exists before the sampling design acts upon it. Moreover, the sampling design is fully characterized by the conditional distribution p (R|Y), the parameters of which are such that and live in a product space × . Hence the joint probability density function of the observed data satisfies p, (Y,R) = Z Z p(Yo,Ym,Z)p (R|Yo,Ym,Z)dYmdZ. (2.1).

Simplifications may occur in (2.1) depending on the sampling design, leading to the three usual types of missingness (MCAR, MAR and NMAR). This typology depends on the relations between the adjacency matrix Y, the latent structure Z and the sampling R, so that the missingness is characterized by four directed acyclic graphs displayed in Figure 2.1.

**NMAR inference: the general case**

In contrast to the MAR case, conducting inference on the observed dyads only may bias the estimates in the NMAR case. In fact, all observed data (including the sampling matrix R in addition to Yo) must be taken into account. The likelihood of the observed data is thus log p, (Yo,R) and the corresponding completed likelihood has the following decomposition: log p, (Yo,R,Ym,Z) = log p (R|Yo,Ym,Z) + log p(Yo,Ym,Z), (2.7).

where an explicit form of p (R|Yo,Ym,Z) requires further specification of the sampling. The joint distribution p(Yo,Ym,Z) has a form similar to the MAR case. Now, the approximation is required both on latent blocks Z and missing dyads Ym to approximate p(Z,Ym|Yo). We suggest a variational distribution where complete independence is forced on Z and Ym, using a multinomial (resp. Bernoulli ) distribution for Z (resp. for Ym): ˜p,(Z,Ym) = ˜p (Z) ˜p(Ym) = Y i2N m(Zi·; i) Y i,j)2Dm b(Yij ; ij), (2.8).

**MAR condition**

Algorithm 1 for MAR samplings is tested on affiliation networks with 3 blocks. The number of blocks is assumed to be known. For this topology the probability of connection within a block is and is ten times stronger than the probability of connections between nodes from different blocks. We generate networks with n = 200 nodes and marginal probabilities of belonging to blocks = (1/3, 1/3, 1/3). The sampling design is chosen as a random-dyad sampling with a varying . The difficulty is controlled by two parameters: the sampling effort and the overall connectivity in matrix , defined by c = P q` q`q`, which is directly related to the choice of : the lower the , the sparser the network and the harder the inference. The simulation is repeated 500 times for each configuration (c, ). Figure 2.2 displays the results in terms of estimation of and of classification recovery, for varying connectivity c and sampling effort . Our method achieves good performances even with a low sampling effort provided that the connectivity is not too low.

**NMAR condition**

Under NMAR conditions we conduct an extensive simulation study by considering various network topologies (namely affiliation, star and bipartite), the connectivity matrix of which are given in Figure 2.3. We use a common tuning parameter to control the connectivity of the networks in each topology: the lower the , the more contrasted the topology.

Among the three schemes developed in Section 2.3, we investigate thoroughly the double standard sampling, for which we exhibit a large panel of situations where the gap is large between the performances of the algorithms designed for MAR or NMAR cases. Other sampling designs are explored in the supplementary materials.

Simulated networks have n = 100 nodes, with varying in {0.05, 0.15, 0.25}. Prior probabilities are chosen specifically for affiliation, star and bipartite topologies, respectively (1/3, 1/3, 1/3), (1/6, 1/3, 1/6, 1/3) and (1/4, 1/4, 1/4, 1/4). The exploration of the sampling parameters = (0, 1) is done on a grid [0.1, 0.9] × [0.1, 0.9] discretized by steps of 0.1.

Algorithm 2 is initialized with several random initializations and spectral clustering. In Figure 2.4, the estimation error is represented as a function of the difference between the sampling design parameters (0, 1): the closer this difference to zero, the closer to the MAR case. As expected, Algorithm 1 designed for MAR only performs well when 1. Results show a good ability of the VEM to estimate these parameters. As expected, performances deteriorate for uncontrasted topologies with low sampling rate.

### Seed exchange network in the region of Mount Kenya

In a context of subsistence farming, studies which investigate the relationships between crop genetic diversity and human cultural diversity patterns have shown that seed exchanges are embedded in farmers’ social organization. Data on seed exchanges of sorghum in the region of Mount Kenya were collected and analyzed in Labeyrie et al. (2016, 2014). The sampling is node-centered since the exchanges are documented by interviewing farmers who are asked to declare to whom they gave seeds and from whom they receive seeds. Since an interview is time consuming, the sampling is not exhaustive. A limited space area was defined where all the farmers were interviewed. The network is thus collected with missing dyads since information on the potential links between two farmers who were cited but not interviewed is missing. With the courtesy of Vanesse Labeyrie, we analyzed the Mount Kenya seed exchange network involving 568 farmers among which 155 were interviewed.

Although other farmers in this region might be connected to non-interviewed farmers, we focus on this closed network of 568 nodes. Since we only know that the sampling is node-centered, we fit SBM under the three node-centered sampling designs presented in Section 2.2 (star (MAR), class and star degree sampling). The ICL criterion is minimal for 10 blocks under the star degree sampling and for 11 blocks under the class degree sampling. The clusterings between the SBMs obtained with either class or star degree sampling remain close from each other (ARI: 0.6) and both unravel a strong community structure. The model selected by ICL for MAR sampling is composed by 11 blocks. The ARIs between MAR clustering and the two other clusterings are lower (around 0.4). Finally, note that interviewed and non-interviewed farmers are mixed up in the blocks of the three selected models. The ICL criteria computed for the three sampling designs are a slightly in favor of the MAR sampling.

On top of network data, categorical variables are available for discriminating the farmers such as the ntora2 they belong to (10 main ntoras plus 1 grouping all the others) and the dialect they speak (4 dialects). In Figure 2.6, we compute ARIs between the ntoras (left panel), the dialects (right panel) and the clusterings obtained with the SBM under the three node-centered sampling designs for a varying number of blocks. Even though the ARIs remain low, the clusterings from class or star degree sampling seem to catch a non negligible fraction of the social organization, larger than the one caught by the clustering from the MAR sampling. These two categorical variables, reflecting some aspects of the social organization, could partially explain the structure of the exchange network.

#### Simulations for star degree and class samplings

With node-centered samplings such as star-degree and class samplings, it is more difficult to find configurations different from MAR sampling designs: with dyad-centred doublestandard design, the sampling parameters were directly related to the probability of an edge conditionally on the observation of the corresponding dyad, which is no longer the case for the node-centered designs. The sampling designs being hardly different from a MAR design, the NMAR inference does not show much improvement compared to the MAR inference. Still, we exhibit some interesting situations where star degree and class samplings deserve an appropriate treatment, presented herein. We simulate networks with n = 100 nodes under an affiliation topology with intra-community probability (resp. inter community probability) equal to 0.5 (resp. 0.05) and = (0.25, 0.5, 0.25). Sampling parameters are chosen such that = (a, b) = (−3.6, 0.1) for star degree sampling, which makes nodes with highest degrees preferably selected. In class sampling, these parameters are set to = (1, 2, 3) = (0.75, 0.5, 0.05), which makes nodes from the largest block and from a small block preferably selected while the other small block is under-sampled. In Figure 2.8, the estimation errors and the ARI are pictured for both cases. The sampling rates (i.e. rates of observed dyads over total number of dyads) lie in the intervals [0.558, 0.844] for class sampling and [0.162, 0.622] for star degree sampling. These two intervals arise from the values of the parameter explored for these two designs. We compare the performances of Algorithm 2 to an oracle (when inference is conducted via a classical VEM algorithm on a fully observed network) and with Algorithm 1. When facing NMAR condition, Algorithm 2 shows a slight improvement over Algorithm 1 even if it remains far from the oracle.

**Table of contents :**

**1 Introduction **

1 Motivations : des données manquantes dans les réseaux

2 État de l’art

3 Contributions de la thèse

**2 Variational inference of Stochastic Block Model from sampled data **

1 Introduction

2 Statistical framework

3 Variational Inference

4 Simulation study

5 Importance of accouting for missing values in real networks

6 Conclusion

7 Supplementary

**3 Consistency and Asymptotic Normality of Stochastic Block Models Estimators from Sampled Data **

1 Introduction

2 Statistical framework

3 Complete-observed Model

4 Main Result

5 Proof Sketch

6 Variational and Maximum Likelihood Estimates

7 Discussion

8 Acknowledgment

9 Supplementary

10 Technical results

11 Main Results

12 Sub-exponential random variables

13 Likelihood ratio of assignments

14 General technical results

**4 SBM with covariates and missing values **

1 Introduction

2 Stochastic Block Models with covariates and sampling models

3 Statistical inference

4 Illustrations

5 Conclusion

**5 missSBM: An R Package for Handling Missing Values in the Stochastic Block Model **

1 Introduction

2 Statistical Framework

3 Structure of the Package

4 Guidelines for Users

5 Illustration: the 2007 French political blogosphere

**6 Conclusion et perspectives **

1 Conclusion

2 Perspectives

**Bibliography **