Get Complete Project Material File(s) Now! »

## Email-exchange Networks

The datasets in this group are based on the exchange of emails. In this group, the entities are individuals or more precisely email accounts. The interactions correspond to one email account sending an email to another account. When a single email is sent to several accounts, this is represented by several interactions.

The Enron dataset represents the emails exchanged by 150 employees of the Enron company [Shetty and Adibi, 2005]. The dataset was obtained by the Federal Energy Regulatory Commission during its investigation in Enron’s collapse. The 150 employees exchanged over 47, 000 emails during three years. For each email exchanged, information on the senders, receivers, and the moment it was sent is available.

Similarly to the Enron dataset, the Radoslaw dataset represents the exchange of emails between the employees of a mid-size company [Michalski et al., 2011]. The company contained 168 employees, who exchanged over 80, 000 emails during a period of 9 months in 2010. Like Enron, for each email, it records information on the senders, receivers, and the moment they were sent. This dataset comes from a leak of the emails exchanged between the members of the Democratic National Committee, which is the formal governing body for the United States Democratic Party [Kunegis, 2013]. The leak occurred in 2016 and the email dump contained all email data exchanged

by the 1891 members. This corresponds to around 40, 000 email exchanges from September 2013 to May 2016. Here, we only kept the sender, receivers, and the moment each email was sent.

### Temporal Eigenvector

This method was introduced by [Taylor et al., 2017]. Unlike the two previous methods that base the importance of a node on the notion of the shortest path between nodes, this third method is a spectral one, and thus is based on paths of all lengths. The network is split into periods representing each an equal duration of the network. For each period, the authors construct an adjacency matrix A. The adjacency matrix at time t is given by At. They combine these matrices into a large matrix called supra-centrality matrix. Its size is NT NT, where N is the number of nodes and T is the number of time periods. The diagonal contains the adjacency matrices constructed for each period.

The remaining of the supra-centrality matrix contains an identity matrix weighted by w parameter. See figure 3.3 for an illustration.

This weight w couples each node with itself over the different periods. This allows the representation of the waiting times, where a node keeps information till it can pass it forward. When w ! 0+, the layers are uncoupled and a node does not store the information over several layers. When w ! ¥, the layers (periods) are strongly linked, the weights on these inter-layer links becoming higher than those inside each period. Thus, the inter-layer links become more important than the intra-layer links. In other words, paths that consist of storing the information with same node for a long time become more important than those that consist of passing the message forward right away. With this supra-centrality matrix, the eigenvector centrality can be computed. This computation would consider all the temporal aspects of the network such as the diffusion of information over time. It also attributes each node a centrality value for each period. We consider this method with a period equal to I and denote it as Temporal Eigenvector.

#### Coverage Centrality

The fourth method and considers the temporal aspect in a different manner and was proposed by [Takaguchi et al., 2016]. They represent a temporal network as a static network where each temporal node consists of a pair of a structural node (original node of the network) and a timestamp t. For example, a node u that interacts with other nodes at t1 and t2 will be represented by two temporal nodes (u, t1) and (u, t2). These two nodes are then linked, to simulate the retaining of information within the same node for a period of time. The second type of links represents the true interactions of the network. In a graph, if the path (u, v, t) exists, it is represented by a two link, one from node (u, t) to (v, t +1) and a second one from node (v, t) to (u, t +1). See Figure 3.4 for illustration.

Building on this representation, the authors introduce two notions of centrality. First, temporal coverage centrality represents the importance of a temporal node (u, t) by the fraction of structural nodes for which a shortest path pass through the node (u, t). In certain cases, a node can have several opportunities to pass the message, and all these opportunities produce the same result (the same departure and arrival time). In observe that node B has three occasions (at t1, t2 and t3) to pass forward the information. If B’s presence at t1 is removed all the the paths from A to D are destroyed. However, if the B is removed at t2 (white node in the figure), a path still exist via B at instant t3. While, the presence of B at t3 is the last chance for B to forward the message. The authors argue that instants where the node can be removed without altering the diffusion need to be detected. Thus, they introduce the temporal boundary coverage centrality, which gives higher importance to nodes on the boundary (node B at t1 and t3) than the nodes that can be removed (node B at t2).

Our preliminary results showed that both centralities reacted in the same way and the difference in results is not significant. Thus, we consider the temporal coverage centrality in this chapter. We denote it by Coverage.

**Global Importance**

A node that is frequently attributed a high rank intuitively can, be considered as a globally important node. To provide a global perspective on the importance of a given node, we study the number of times the node is attributed a high or a low rank.

To compute this number, first, we need to be define a high and low ranks. We start by defining three regions in the ranking vector called the top, middle and bottom region. A rank in the top region is considered as a high rank, while a rank in the bottom region is considered as low. Formally, for a network of n nodes, a node with a rank higher than bn 0.75c is considered to be in the top region and thus, to be highly important. While a node with a rank lower than bn 0.25c is considered to be in the bottom region and therefore, to have a low rank. Ranks in between are considered to be in the middle region. As we compute the centrality and ranks for the nodes every I-th second, the number of times each node is in the top (resp. bottom) region can be represented by a duration that we denote Durtop (resp. Durbot). To compute these durations, we consider that a node is present in the top region or bottom region from the instant where we compute the centrality to the following computing instant. Given a node u, if R(u) is the sequence of ranks of u and the i-th element of R(u) denoted by ri gives the node’s rank at the i-th instant, we formally define Durtop(u) and Durbot(u) as: Durtop(u) = I jfi k 1, ri bn 0.75cgj .

**Table of contents :**

**1 State of the Art **

1.1 Static Centralities

1.1.1 Shortest path based centralities

1.1.2 Spectral Centralities

1.2 Temporal Centralities

1.2.1 Online Algorithms

1.2.2 Shortest path based centrality adaptations

1.2.3 Spectral centrality adaptations

1.2.4 Conclusion

1.3 Approximation

1.3.1 Static graphs

1.3.2 Evolving graphs

**2 Datasets **

2.1 Email-exchange Networks

2.2 Co-occurrence networks

2.3 Social Networks

2.4 Motion networks

2.5 Conclusion

**3 Temporal Centralities under scrutiny **

3.1 Centralities

3.1.1 Temporal Closeness

3.1.2 Snapshot

3.1.3 Temporal Eigenvector

3.1.4 Coverage Centrality

3.2 Comparison protocol

3.2.1 Ranking

3.2.2 Kendall Tau correlation

3.2.3 Difference in ranks

3.2.4 Global Importance

3.3 Results

3.3.1 Global Observation

3.3.2 Impact on individual nodes

3.3.3 Identifying globally important nodes

3.4 Discussion

3.4.1 Average centrality

3.4.2 Absence of importance

3.4.3 Ranking methods

3.4.4 Conclusion

**4 Approximation and Identification **

4.1 Less computation

4.1.1 Conclusion

4.2 Identifying important nodes

4.2.1 Strategies

4.2.2 Best combination

4.2.3 Results

4.2.4 Ordering

4.2.5 Conclusion

4.3 Approximation

4.3.1 Exponential function

4.3.2 Sigmoid function

4.3.3 Conclusion

4.4 Overall protocol

4.4.1 Methodology

4.4.2 In practice

4.5 Discussion

4.5.1 Absence of importance

4.5.2 Conclusion

**5 Ego-betweenness **

5.1 Ego-centric vision

5.1.1 Ego-graph and ego-dynamic graph .

5.1.2 Paths

5.2 Ego-betweenness

5.2.1 Definition

5.2.2 Computation

5.3 Dynamical betweenness under scrutiny .

5.3.1 Common base

5.3.2 Datasets

5.3.3 Comparison tools

5.4 Results

5.5 Conclusion

**6 Conclusion and perspectives **

Appendices

**A Approximation and Identification **

A.1 Relative Error per multiple of Iop

**B P2PTV Multi-channel Peers Analysis **

B.1 Introduction

B.2 Related Works

B.3 DataSet

B.4 Global analysis

B.4.1 Tracking exchanges of video content .

B.4.2 Presence multi-channel peers

B.5 Exploiting sliding time windows

B.5.1 Different peer behavior

B.5.2 Different super-peers behavior

B.6 Extending the analyses .

B.6.2 Comparisons of the datasets

B.7 Conclusion