Network topology’s influence on epidemic processes
When studying epidemic spread on static networks, it is of particular interest to under-stand how the topological structure of the network influences the course of the epidemic [46, 75, 49, 68, 64, 94, 81, 45]. To this end, often simulations on networks with particular structural properties are compared to simulations on random networks. It has thus been found, that the nature of the degree distribution has a strong influence on the epidemic threshold and the overall outcome of the epidemic. In particular, networks with a scale-free degree distribution facilitate epidemic outbreaks. In these scale-free networks the epidemic threshold is reduced, and epidemics propagate faster than in random networks [75, 76, 54, 65]. This has a direct con-sequence for immunization strategies. Random vaccination on scale-free networks is ineﬃcient, as it often removes only nodes with limited importance for the spreading process, while targeted immunization can easily lead to a complete disintegration of the network by removing the hubs, nodes with high degree and consequently high importance [2, 23].
Another property of many social networks is a dense community structure . While com-munities can facilitate spreading on a limited local scale, they globally hinder the diﬀusion of information. If on top of the community structure weights are correlated in such a way with the topology that inter-community links have low weight and intra-community links have high weight , this eﬀect of trapping information in communities is enhanced . If community structures are strong, it is therefore a good immunization strategy to vaccinate individuals who bridge diﬀerent communities . Similarly, clustering slows down spreading  and can reduce the epidemic size  and R0 .
Furthermore, when information on the temporal structure of the network is available, a comparison between simulations on dynamic networks with particular temporal properties and networks with randomized dynamics can inform on the influence of particular temporal properties of the network. In particular the burstiness of contact patterns slows down epidemic spread  while correlations between events can sometimes facilitate the propagation of epidemics .
The subsequent chapters treat the following subjects. Chapter 2 gives a short summary on the datasets used. Limitations of data and the eﬀects of decisions concerning the data representa-tions, such as the choice of the minimal time step length of temporal networks, are also discussed. In chapter 3, we investigate the eﬀect of the dynamics of the network on the spreading process on the network. A focus will be put on the interplay between the timescales of the data and the process on the data, as well as the finite time eﬀects of data. In the following chapter (Ch.4) we follow two directions to simplify the data representation. On the one hand, we look at the optimal aggregation time of the temporal network, on the other hand, we try to simplify the aggregated networks by grouping nodes together. Here we introduce a contact matrix of distributions, which allows us to keep some of the heterogeneity of the links, even though single nodes adopt group properties, losing their individuality. We consider the eﬃcacy of immuniza-tion schemes which can be derived from networks based on diﬀerent data representations with diﬀerent levels of detail in chapter 5. As we have a wide choice of data representations, we try to find a method which uses the maximum level of detail of the data in order to choose the optimal nodes for immunization, and we discuss its limitations. In chapter 6, we test for the amount of data necessary to make predictions for immunization schemes and the reliability of such predic-tions. We will also test the applicability of generalized data representations to other situations and discuss its limits. Finally, in chapter 7, we look at the relation of distances on static and dynamic networks and at the distribution of temporal distances as well as the distribution of the number of intermediary nodes in spreading processes. We give a short conclusion in chapter 8.
Network science has experienced a new surge of interest with the availability of large datasets. The discovery that many empirical networks describing systems relevant in diverse contexts share essential properties, like a scale-free or at least very broad distribution of degrees [22, 7], raised hopes that the theory of complex networks facilitates a general and unified theory of complex systems.
The study of the role of dynamical properties of temporal networks could only recently profit from detailed datasets. This opens the way for advanced, data-backed research on many open questions concerning the interaction of people, their dynamical contact patterns, the importance of single individuals on the spread of epidemics, temporal distances in networks and many more.
Dynamical (and topological) properties of real contact networks and their influence on dy-namical processes on the network can be studied on the temporal data in order to understand which features of the dynamical properties have the most significant influence on specific dy-namical processes on the network. These features can then be extracted, compared for diﬀerent datasets and used in order to construct models of temporal networks [90, 78].
Data is also needed to inform existing models. For example, models for epidemic simulations can use census data, age data, data of human mobility and airline transportation data [99, 5]. Depending on the model, diﬀerent degrees of precision of the data are needed. Precise data with high resolution is yet rare and therefore most models are using more general data. Comparing models informed by data representations containing various levels of detail can give an insight about how much information and which level of precision for data is needed in order to obtain results at a specific level of precision.
The data used here has a very high resolution. This allows us to compare results obtained by using precise data with results based on more general data – especially general static data – and discuss the diﬀerences due to various levels of detail on dynamic and individual information. The data was collected in the framework of the SocioPatterns collaboration , which we introduce in Sec.2.1.
Even though very precise data is a very good proxy for reality, there are still many limits. When a network is formed, decisions need to be taken about what is considered a node and what an edge. These decisions, like the choice of a threshold on the creation of edges, can greatly influence the topology of the network and the outcome of simulations. The right network representation for the given problem or question is therefore essential. . Thus, in every data collection a bias is introduced through the choice of the network representation and selection of the information that is included (see sec. 2.3). Furthermore, datasets include errors, which can be eliminated by cleaning, but at the same time an expectation bias is introduced, as outliers are discarded, when they do not meet the expected criteria of the corresponding distribution (see Sec. 2.4). In addition, datasets represent only a limited window of reality. We discuss the eﬀects of its limitation in time in Sec. 3.4, its limitation in size in Sec. 2.5 and its limitation in resolution in Sec. 2.6.
The data sets used in this thesis were collected by the SocioPatterns collaboration . They comprise face-to-face contact data between individuals at diﬀerent venues. The data sets which we will use come from two settings: conferences and hospitals. Participants were equipped with radio-frequency identification (RFID) tags, which emit and receive signals in a peer-to-peer fashion. The emitted radio packets contain a unique identifier for the device and the time of emission. The tags register contacts autonomously whenever two participants face each other at a distance below 1-1.5 meters. The angle of detection is about 120 ◦. As radio signals are absorbed by the body water, the device can only eﬃciently emit signals towards the front of the body, thus greatly reducing the risk of false positive contacts of people who are in proximity but not facing each other.
The resolution of the contact data is extraordinarily precise, as contacts are registered con-tinuously. However, detection of contacts is not instantaneous. The RFID tags alternate emit and receive cycles. When a packet is emitted during the emit cycle, it can only be registered by another RFID tag in the receiving cycle. Thus, it can take some time before a contact is reg-istered. The contact data was therefore discretized into 20 second timesteps, which guaranteed with a probability of 99 % that an actual contact was recorded . Also, meaningful contact lengths were assumed to not last much shorter than 20 seconds.
Further details about the collected data and the method of data collection can be found at the SocioPatterns website (www.sociopatterns.org) and in related papers [85, 20, 11].
The data which are used throughout this work are face-to-face contact data of diﬀerent venues, from the SocioPatterns collaboration . The data vary in many properties, like number of participants, density of links or number of days. Depending on the setting, some data also have meta-information about the participants. For instance, the participants in the hospital data sets can be classified as Assistants, Doctors, Nurses, Patients or Caregivers. In order to understand the processes on the networks better, it is of advantage to know some of the structural and dynamical properties of the data. In the following, a short overview of some important aspects of the data are given. We use two types of data sets, data from conferences and data from hospitals. The conferences were the Congress of the ’Soci´et´e Francaise d’Hygi`ene Hospitali`ere (sfhh), the European Semantic Web Conference (eswc) and the ACM Hypertext Conference (ht). The hospital data came from the Childrens’ Hospital Ospedale Bambini Gesu (obg) in Rome and from a pediatric ward in a hospital in Lyon (lyon2011 and lyon2012). The length of the data set and the number of participants for each data set are given in Tab. 2.1.
In Fig. 2.1 the number of participants which are in contact with other participants is given for each timestep, as well as the number of connections which are active at each instant. This activity shows strong daily patterns. Coﬀee breaks and lunch breaks are marked by high peaks of activity.
Figure 2.1: Number of active nodes and links per timestep. On the left column are the conference data sets ”sfhh”, ”ht” and ”eswc”, on the right column the hospital data sets ”obg”, ”lyon2012” and ”lyon2011”
Table 2.1: For each dataset, the type of the data (conference or hospital), the year in which it was collected, the time in days over which it was collected and the number of participants is given. References are provided for data sets which are publicly available or for which analysis have been published.
Table 2.2: E: number of events L: number of links in the fully aggregated network, number of connections among nodes, ld: link density, W: total time of all contacts (in seconds), w : average weight per link, corresponds to W/(T L), where T is the length of the data set in seconds, d : average degree, C : average clustering coeﬃcient whereas during the night no contacts take place. Depending on the dataset, weekly variations can also be noticed. For instance, there is a strong drop in activity during the weekend in the ”lyon2012” dataset and the first and last days of the ”eswc” conference show lower activity. The percentage of nodes which are in contact at any time varies also between the datasets. While in the ”obg” dataset, at no time more than about 15% of the participants are in contact at the same time, in the ”sfhh” dataset at peak times it is over 30%. The networks also diﬀer in the number of diﬀerent contacts per person, the average number of events per time and other properties as listed in Tab. 2.2.
Degree vs. Strength
Nodes can be classified according to their degree and strength. If the contact time were identical for each link, then degree and strength would be linearly correlated. This is however not the case. As can be seen in Fig. 2.2, higher degree leads to higher strength, but not in a linear way. Participants with many diverse contacts also spend more time on average per contact. This superlinear dependence can be observed in many data sets and hints at the presence of superspreaders, individuals with many and intense contacts . Furthermore, especially in the hospital data sets, the relation between degree and strength depends also on the class to which nodes belong. A particular case is the children’s hospital (”obg”), where young patients were attended by their parents or other caregivers. For these classes, the relation between degree and strength is fundamentally diﬀerent from other classes, as a low number of diverse contacts is set against high strength, resulting from few but very intense relations.
Figure 2.2: For each data set (conferences at the left, hospitals at the right), the strength of the nodes is plotted over the degree. The nodes of the hospital data set have meta information concerning the role they occupy. The nodes of the ”lyon2011” and ”lyon2012” data sets can be diﬀerentiated into Assistants, Nurses, Doctors, Patients and others, where others are people from diverse roles. The ”obg” data set includes Assistants, Nurses, Doctors, Patients and Caregivers. Patients were children in this hospital. Caregivers were parents or mentors which accompanied the child.
Figure 2.3: For all data sets are shown: (a) contact-time distribution, (b) inter-contact time distribution, (c) waiting-time distribution, (d) distribution of total contact time per link, which is proportional to the weight distribution, when divided by the respective length of the dataset.
The weight distribution is very broad. It is related to the distribution of total contact time per link. In Fig. 2.3(d) the distributions of total contact time per link for the six data sets are plotted. While the distributions for the conference data sets are very similar, the distributions for the hospital data sets diverge slightly for low contact times. Especially in the ”lyon2012” dataset, there are comparatively few links with low contact times. This could be due to the fact that contacts in the hospital are less regulated by chance. Many contacts follow the schedule of the hospital, for example, when nurses or doctors visit the patients on a regular basis, or when they meet the same set of other nurses or assistants. In conferences short and unique contacts are more likely to happen by chance, as participants can mix freely during breaks. In the appendix, the weight distribution for contacts among and between diﬀerent classes is shown for the ”obg” dataset.
The contact-time distribution on the other hand is very similar for all data sets and shows behaviour similar to scale free distributions (see Fig. 2.3(a)). The distribution of the time that passes between two contact events among the same two neighbors also has a long tail, but a small peak appears at a duration of one day, or rather a small dip at about 12 hours. As contacts are more likely to happen during the day, the duration between two contacts is less likely to be in the order of twelve hours and more likely to be one entire day, even though in general longer times between two contacts are less likely and short inter-contact times are abundant (see Fig. 2.3(b)). The waiting-time distribution (Fig. 2.3(c)) also shows broad behaviour with a peak due to the day-night patterns.
Limitations of data
The data collected via RFID tags has a great advantage over data collected by questionnaires. Contacts are registered directly and therefore not subjectively biased. This method also avoids some of the shortcomings of traditional data collection via questionnaires like the informant inaccuracy or the fixed choice eﬀect . However, other limitations cannot be eliminated. Datasets are a limited version of reality. They only show one instance of reality, limited in time, space and resolution. Datasets inevitably need to have a beginning and an end. Here, the time was limited to a few days. Also, only one specific place was surveyed, in this case a hospital or conference. As soon as participants leave the area, taking oﬀ their badges, contacts are not registered anymore. Participation of the experiments was on a voluntary basis and some individuals have declined to wear a badge. In case they also share behavioural properties, this could have introduced a sampling bias (see also Sec. 2.5). Some participants did not properly handle their RFID tags, which led to some spurious contacts (see Sec. 2.4). The data sets are also limited in time, extending only over a few days. For simulations which last longer than the dataset, an artificial repetition or extension of the data set is necessary.
Furthermore, data is always collected for a purpose, focusing only on those parts of reality which serve this purpose. Only one specific detail is considered and everything thought unnec-essary is disregarded. In order to better understand communication patterns between people, many diﬀerent datasets can be taken into consideration, each one adequate to answer a diﬀerent question: telephone calls, email contacts, interaction in online forums, face-to-face contacts. To simulate the global spread of epidemics, census data, data on human mobility and travel data can be used.
A good proxy for the probability of transmission of diseases, like influenza, which are trans-mitted via aerosols or droplets in close contact [98, 38, 108], are face-to-face contacts. To be more specific, simulation of influenza transmission via small aerosol particles only requires in-formation on room co-presence as particles stay air-born for some time, whereas larger particles settle quickly and require close contacts for transmission . Nevertheless, other transmission pathways are possible as well. The transmission of mainly fomite-mediated diseases, for example, might not be simulated successfully by knowing only face-to-face contacts, as no information on touching or shared objects is registered.
Furthermore, a possible dependence of transmission probability on inter-personal distance might not be proportional to the distance-dependent detection probability of signals. The radio signals emitted by the RFID tags can have diﬀerent predefined strengths, ranging from 1-2 meter up to several meters. The present data is registered with weak radio signals, limiting the data to true face-to-face contacts at a distance of 1-2 meters, which are a reasonable proxy for human communication. Influenza transmission can, however, also happen at larger distances through sneezing or small particle aerosols .
Other aspects which play a role in disease transmission, like information about the immunity of participants due to prior exposure, are not captured by the available data, mainly because they are unkown. Except for their contact activity, people are considered identical. Where meta-information is given, like information about the roles in the case of the hospital, which could be correlated to frequent exposure and immunity, this information is not used to assign diﬀerent properties to people. Another aspect of reality which cannot be captured in our simulations is the change of behaviour people undergo when faced with the possibility of infection vis-a-vis an infectious individual. During the data collection no individuals with respiratory illnesses were registered. However, all of these aspects can be considered as being part of the stochasticity of transmission and the choice of transmission parameters.
A short note about cleaning
Once the data have been stored, they need to be checked for errors and spurious contacts. Some participants do not handle the equipment as intended, taking oﬀ their badges or sometimes even leaving them next to other badges, which leads to continuously registered contacts. Some RFID tags could have been faulty, not detecting or emitting radio signals in a reliable way. Also, when badges are distributed at the beginning of the experiment or collected at the end contacts can erroneously be registered. These spurious contacts can be seen as abnormal behaviour in the data sets. High and narrow peaks of activity at the beginning or end of the data set, as well as outliers in the distributions of contact times, are an indication for spurious signals. Also, an augmented number of simultaneous contacts per node can be a sign of false positives. In some datasets, therefore, parts of the data were stripped oﬀ, some nodes were removed completely, and some partly if the number of simultaneous contacts exceeded a threshold of 6. In the ”obg” dataset, only contacts which last less than one hour are considered.
However, deciding which registered behaviour is abnormal and due to a technical or human error, and which registered behaviour is real, is often diﬃcult as no additional information is given to decide whether an outlier in the data is due to an uncommon event or due to an error in handling the material. Therefore, cleaning data will not only erase errors but also introduce a small expectation bias.
The observed interactions are just a subset of global interactions since people in the venue were not separated from the rest of the world. They could leave the venue, go home or to a hotel at night and meet other people with whom they could interact. Furthermore, participation was voluntary. Some people chose not to participate. Therefore, participants in the experiment were only a subset of present individuals. However, the fraction of people which agreed to participate was rather large in the used data sets. In the case of the ”lyon2011” and ”lyon2012” datasets, more than 90% of the individuals in the ward agreed to participate in the study. In the ”obg” dataset, after cleaning, about 65% of the individuals are registered in the data. In the ”sfhh” data, only about 30% of the conference participants agreed to wear RFID tags, in the ”ht09” data it was about 75%. In addition, some contact events occurring during a measure might not have been registered. However, the probability of registering an event which lasted at least 20 seconds within a 20 seconds timestep was very high, so this does not play a major role here. Nevertheless, the detection of a face-to-face contact is subject to a threshold related to the distance of participants. Thus, many potential contact events taking place at a larger distance are not registered, thereby transforming the topology of the network .
The eﬀects of incomplete data can be modeled through diﬀerent sampling methods. An in-complete set of participants, for example, can be modeled by node sampling. If node sampling does not happen randomly, if for example people are more likely to participate in a study when one of their friends participates, then people with more friends will be more likely to partici-pate, introducing a sampling bias. This will inevitably change network properties like the degree distribution. Similarly it is possible that people are more likely to participate if a certain per-centage of their community participates, which can have the eﬀect that entire communities are excluded from the sample. But even random sampling does not leave the network unchanged. For example, a degree distribution which is scale free in the complete network might not be scale-free in the sampled network . Other properties of the network change as well when the network is sampled [71, 50, 48, 26, 92, 91, 20]. In which way these properties change depends on the sampling method . We will only consider incomplete data through node sampling. The related change of network properties has a direct eﬀect on the outcome of processes on the network. The eﬀects of incomplete data are inherently the same as the eﬀects of node removal, for example through attack of nodes in a network [23, 24] or through immunization [77, 25].
We try to get an insight on the size of this eﬀect on the outcome of epidemic processes on the network data. To this end we simulate an SIR model on random samples of diﬀerent sizes. In Fig. 2.4 we see the results of sampling on the epidemic spreading for the ”sfhh”, the ”obg” and the ”lyon” datasets. The used sampling method is comparable to the vaccination of random individuals. As the distribution of the final number of cases is bimodal, removing nodes from the network has two eﬀects, it decreases the average number of final cases for all runs that attain a certain percentage (here we generally choose 10%) of the network, and increases the number of runs, for which the outcome of the epidemic is below this percentage. For the ”sfhh” network, the attack rate (AR), the final size of the epidemic divided by the number of nodes in the network, decreases already visibly with the random removal of very few nodes. Thus, in strongly sampled networks the fraction of the network that gets attained by the epidemic is underestimated. This eﬀect, that the percentage of infected participants decreases with the removal of single nodes, is most likely stronger for sparse networks than for densely connected networks.
The data used here is discretized into network snapshots at diﬀerent time instants. The choice of the minimum timestep size proves to be quite important. We simulate the choice of diﬀerent timestep sizes by aggregating the network with diﬀerent aggregation timesteps. By doubling the aggregation timestep size, we merge two consecutive network snapshots into one. Whenever these two network snapshots contain the same nodes but with diﬀerent links, then the new merged network snapshot contains nodes with higher degree. The chosen timestep length in the discretization of the dataset has therefore a direct influence on the distributions of the average degree of all nodes per timestep. In Fig. 2.5(a) the dependence of the average of this measure over all timesteps in the complete temporal graph on the length of the aggregation timesteps is shown. Longer aggregation increases the number of contacts which happen at the same time and thus the average instantaneous degree of the network at each timestep. In order to make results more consistent, nights were removed and data were stripped to a length which is a power of two.
When the network is aggregated into larger timesteps, the minimum connection time of contacts is also increased. For contacts which last for the length of one timestep, this means at least a doubling in contact length. Furthermore, several consecutive short contacts between the same two nodes can merge to a longer contact, and a large fraction of contacts is extended by one or even two timesteps, depending on their starting time and duration. The distribution of contact length depending on the aggregation timestep is shown in Fig. 2.6(a) for the ”lyon2012” dataset. All contact lengths increase significantly when aggregation timesteps are doubled.
Table of contents :
1.1 Why networks?
1.2 What are networks?
1.3 How to classify networks?
1.4 Dynamic processes on networks
1.4.1 Epidemic spreading
1.5 Network topology’s influence on epidemic processes
2 Data collection
2.2 The datasets
2.2.2 Degree vs. Strength
2.2.3 Contact-dynamics distributions
2.3 Limitations of data
2.4 A short note about cleaning
2.5 Incomplete samples
2.6 Discrete timesteps
3 Epidemic simulation on temporal network data
3.1 Activity fluctuations
3.2 Influence of starting time
3.3 Effect of nights
3.4 Finite time
3.5 Model networks
4 Data representation
4.1 Time resolution
4.2 Structural resolution
4.2.1 Choice of groups
4.2.2 Heterogeneity of weights
4.2.3 Daily networks
4.2.4 Influence of roles
5 Immunization on dynamic networks and data representations
5.1 Influence of the data representations
5.2 Immunization strategies on static data representations
5.3 Effect of a limited time window
5.4 Time dependence of ranking efficiency
5.5 Immunization strategies on dynamic networks: significance
6.1 Degree ranking
6.2 Data-based predictions of epidemic spread
6.2.1 Comparing datasets
6.2.2 Effect of data variability on epidemic predictions
7.1 Static distance vs. dynamic distance
7.2 Temporal path lengths and infection-path lengths
7.2.1 Discrete vs continuous
7.2.2 Influence of link density
7.2.3 Influence of the weight distribution
7.3 Distance on face-to-face contact networks