Identifying Roles in an IP Network with Temporal and Structural Density

Get Complete Project Material File(s) Now! »

Description of the algorithm

One may trivially enumerate all maximal cliques in a graph as follows. One maintains a set M of previously found cliques (maximal or not), as well as a set S of cancomputing didate cliques. Then for each clique C in S, one removes C from S and searches for nodes outside C connected to all nodes in clique C, thus obtaining new cliques (one for each such node) larger than C. If one finds no such node, then clique C is maximal and it is part of the output. Otherwise, if the newly found cliques have not already been found (i.e., they do not belong to M), then one adds them to S and M. The set S is initialized with the trivial cliques containing only one node, and all maximal cliques have been found when S is empty. The set M is used for memorization, and ensures that one does not examine the same clique more than once. In [Johnson et al., 1988] the authors use this framework to enumerate all maximal cliques of agraph in lexicographic order.
Our algorithm for finding D-cliques in a link stream L = (T,V, E) (Algorithm 1) relies on the same scheme. We initialize the set S of candidate D-cliques and the set M of all found D-cliques with the trivial D-cliques (fa, bg, [t, t]) for all (t, a, b) in E (Line 2). Then, until S is empty (while loop of Lines 3 to 24), we pick an element (X, [b, e]) in S (Line 4) and search for nodes v outside X such that (X [ fvg, [b, e]) is a D-clique (Lines 6 to 10). We also look for a value b0 < b such that (X, [b0, e]) is a D-clique (Lines 11 to 16), and likewise a value e0 > e such that (X, [b, e0]) is a D-clique (Lines 17 to 22). If we find such a node, such a b0 or such an e0, then D-clique C is not maximal and we add to S and M the new D-cliques larger than C we just found (Lines 10, 16 and 22), on the condition that they had not already been seen (i.e., they do not belong to M). Otherwise, C is maximal and is part of the output (Line 24).

Computational complexity

In order to investigate the complexity of our algorithm, let us denote by n = jVj the number of nodes and m = jEj the number of links in L. First notice that the number of elements in the configuration space built by the algorithm is bounded by the number of subsets of V times the number of sub-intervals of T.
Moreover, for all D-clique C = (X, [b, e]) in the configuration space, there exists a link (b, u, v) or a link (b + D, u, v) in E. Indeed, the initial trivial D-cliques are in the first case, and all D-cliques obtained from them are also in this case until Line 16 is applied. The D-cliques built after this are in the second case. Likewise, there exists a link (e, u, v) or a link (e 􀀀 D, u, v) in E. Therefore, the number of possible values for b and e for any D-clique in the configuration space is proportional to the number of time instants at which a link occurs, which is bounded by the number of links m. The number of sub-intervals of T corresponding to a D-clique in the configuration space is therefore in O(m2). This bound is reached in the worst case, for instance if the stream is a sequence of links occurring once every D time interval.
The trivial bound O(2n) for the number of subsets of V is also reached in the worst case, for instance if there is a link between all pairs of nodes at the same time: the algorithm will enumerate all subsets of V. Therefore, the number of elements in the configuration space is in O(2nm2). This leads to the space complexity of our algorithm: it is proportional to the space needed to store the configuration space, which is in O(2nnm2) since each element may be stored in O(n) space.

Impact of data structure

Notice that Algorithm 1 makes no assumption on the order in which elements of S are processed, which corresponds to the way we explore the configuration space. In particular, if S is a first-in-first-out structure (a queue), the algorithm performs a BFS of the configuration space; if it is a last-in-first-out structure (a stack) then it performs a DFS. The execution time is essentially the same in all cases. The size of S may vary, but the space complexity of the algorithm is dominated by the size of M, that does not change. Still, the data structure impacts the order in which D-cliques are found.

Traffic captures

We use the data gathered on June 25, 2013 as part of the experiment « A Day in the Life of the Internet » (DITL). The data was collected (using port mirroring on a router) on a transit link between WIDE and the upstream ISP, see Figure 8.1. Due to the nature of the capture, the resulting network is intrinsically bipartite: traffic between two WIDE machines (or between two « Internet » machines) cannot be seen at this router, and is not captured. Thanks to expert knowledge from the maintainers of the MAWI repository, we can precisely know which machines are in and out of WIDE.

Table of contents :

I A basic language for link streams
1 Context
1.1 Models inducing information loss
1.2 Models inducing no information loss
1.2.1 Static graphs from sequences of interactions
1.2.2 Temporal networks
1.2.3 Time-Varying Graphs
1.3 Conclusion
2 Basics
2.1 Link streams
2.2 Sub-links and sub-streams
2.3 Line stream
2.4 Induced graphs
2.5 Conclusion
3 Density-based notions
3.1 Density, neighborhood and degree
3.2 Node clusters
3.3 Cliques
3.4 Clustering coefficient and transitivity ratio
3.5 Conclusion
4 Paths-based notions
4.1 Paths
4.2 Connectivity
4.3 Centralities
4.3.1 Closeness centrality
4.3.2 Betweenness centrality
4.4 k-closure
4.5 Conclusion
5 D-analysis of link streams
5.1 D-analysis and instantaneous links
5.2 Density
5.3 Degree
5.4 Clusters
5.5 Conclusion
6 Computing cliques in link streams
6.1 Definitions
6.2 Algorithm
6.2.1 Description of the algorithm
6.2.2 Proof of correctness
6.2.3 Computational complexity
6.3 Experiments
6.3.1 Dataset
6.3.2 Impact of data structure
6.3.3 Examining D-cliques
6.4 Conclusion
7 Conclusion and perspectives
II Application to IP traffic
8 Context
8.1 Sources of data
8.1.1 CAIDA
8.1.2 MAWI
8.1.3 USC ANT
8.1.4 KDD Cup
8.2 Analysis of IP traffic
8.2.1 Graph-based
8.2.2 Time-series analysis
8.2.3 Machine-learning
8.3 The MAWI dataset
8.3.1 Traffic captures
8.3.2 MAWILab event database
9 MAWI traffic as a link stream
9.1 Traffic modelling
9.2 Bipartite link streams
9.2.1 Density and clustering coefficient
9.2.2 Projection
9.2.3 Redundancy
9.3 Conclusion
10 Analysis of IP traffic
10.1 Our approach
10.1.1 Event characterization
10.1.2 Event identification through manual inspection
10.1.3 Hierarchy of features for the analysis of IP traffic
10.1.4 Analysis of IP traffic process
10.2 Results
10.2.1 Preliminary features
10.2.2 Basic features on the link stream
10.2.3 Comparison with MAWILab
10.3 Conclusion
11 Conclusions and perspectives
III Conclusions and perspectives
12 Summary and Contributions
13 Perspectives
13.1 Generalizing link streams
13.1.1 Computational considerations
13.1.2 Algorithmic considerations
13.2 Modules and dense groups
13.2.1 Modular decomposition
13.2.2 Dense groups and communities
13.2.3 Generation of synthetic link streams
13.2.4 Visualization
13.2.5 Event detection in link streams
13.2.6 Applications
13.3 Prediction
Appendices
A Analysis of the temporal and structural features of threads in a mailing-list
A.1 Introduction
A.2 Dataset
A.3 Framework and notations
A.4 Basic statistics
A.5 Interactions within threads
A.5.1 Density of threads
A.5.2 Intra-thread density
A.6 Relations between threads
A.6.1 Inter-thread density
A.6.2 Graphs between threads
A.6.3 Quotient stream
A.7 Conclusion
B Identifying Roles in an IP Network with Temporal and Structural Density
B.1 Introduction
B.2 Notion of D-density
B.2.1 Framework
B.2.2 D-density of links
B.2.3 D-density of streams and sets of nodes
B.3 Identifying roles
B.3.1 Our datasets
B.3.2 Identifying relevant D
B.3.3 Neighborhoods and clustering coefficient
B.3.4 Interpretation
B.4 Conclusion