Processing huge graphs on a single machine is ineﬃcient due to the limited memory size and processing power. Therefore, one solution is that these graphs have to be partitioned across diﬀerent computing nodes and processed in parallel. Exchanging large amount of data between these computing nodes is expensive, so it is important to reduce the communication between them. Good quality partitioning is achieved by focusing on two objectives: balancing the computation load among diﬀerent computing nodes and reducing the communication cost between them. This problem of dividing the graph among diﬀerent computing nodes keeping the communication cost minimum and balancing the load is called balanced graph partitioning. It is an N P -hard problem .
Assumptions: We have explained the communication cost with respect to the message-passing model  in which the vertices of the graph communicate by send-ing messages. This communication is not necessary for all graph processing algo-rithms as there are algorithms that do not require the vertices to communicate, but in order to explain the communication cost we consider the message-passing model.
There are two main approaches for graph partitioning, namely: Edge partitioning and Vertex partitioning. Diﬀerent graph partitioning algorithms have been developed based on these approaches.
In vertex partitioning, the vertices of a graph are divided into k equal sized partitions such that the edges between the partitions are kept minimum. This is also referred as edge-cut partitioning.
Edge- Cut Definition: For a graph G = (V, E), having E edges and V vertices; E\ÕE is the set of edges such that the graph GÕ = (V, E\EÕ) is disconnected. Here, EÕ is the edge-cut.
For understanding the edge-cuts consider an input graph being partitioned, during partitioning if a vertex is assigned to one partition and its neighbor to another, then the edge between them forms an edge-cut. As can be seen in Figure 1.1(a) the vertices are placed in diﬀerent partitions, and the lines between the partitions are the edge-cuts.
The aim of good vertex partitioning algorithms is to reduce the edge-cuts and balance the computation load. In a message-passing model the neighboring vertices of a graph communicate by sending messages. When the neighbors of a vertex belong to diﬀerent partitions, then the cost of sending messages between the partitions is called the cut-cost. These messages can be sent with or without aggregation. Aggregating the messages means that diﬀerent messages that are supposed to be sent to the neighbors present in a diﬀerent partition are combined, and an aggregate of these messages is created. This aggregate message is later sent to the partition. As shown in Figure 2.1 the messages of vertex b and c are aggregated and sent to the vertex h in the other partition. Whereas, for no aggregation all the messages are sent separately to the neighbors belonging to the other partition; as in Figure 2.1 the vertices b and c send messages separately to the vertex h.
Figure 2.1: No Aggregation and Aggregation of Messages in Vertex Partitioning
In case of no aggregation, the cut-cost for a vertex to its neighbors is equal to the number of its neighbors placed in diﬀerent partitions. In Figure 2.1 this cost is two for the vertex h, as the vertex h gets two messages from its neighbors b and c. However, in- case of aggregation, the cut-cost of for a vertex is equal to the number of partitions in which its neighbors are placed. In Figure 2.1 this cost is one as the neighbors of the vertex h are present in one partition, other than the partition where h is present.
If a graph is partitioned in a way that there are large number of edge-cuts, then this creates a lot of communication cost between the partitions due to the fact that large number of messages are exchanged between the partitions. Therefore, the aim of a good partitioner is to keep the edge-cuts minimum.
A relatively new technique for graph partitioning was proposed in ; it is called edge partitioning . In edge partitioning the edges of a graph are divided among k equal sized partitions such that the vertices that are cut between the partitions are kept minimum. This is also referred as vertex-cut partitioning.
Vertex-Cut Definition: For a graph G = (V, E), having E edges and V vertices;
V \ÕV is the set of vertices with EÕ set of edges incident to them, such that the graph
GÕ = (V \V Õ, E\EÕ) is disconnected. Here, V Õ is the vertex-cut.
Each edge contains two vertices, called the end vertices. The end vertices indicate the source and the destination vertex for the edge. To understand the vertex-cut consider that during partitioning of a graph if an edge is assigned to one partition, and another edge having same end vertex is assigned to another partition, then a vertex-cut is formed between the partitions. For the vertex-cuts, the vertex copies or replicas are created in diﬀerent partitions depending upon the distribution of their edges among the partitions. As in Figure 1.1(b) the edges are partitioned in diﬀerent partitions, and the dotted lines between the partitions are the vertex-cuts. Vertex cut shows the link between two copies of the same vertex (v) maintained in diﬀerent partitions.
The vertex sends messages for synchronization of its states to the partitions contain-ing its copies. Therefore, synchronizing the state of the vertex with copies present in diﬀerent partitions introduces a communication cost called as the cut-cost between the partitions. The messages can be aggregated or sent without aggregation. We assume that one copy of the vertex act as the master vertex, which collects the messages from its neighbors in other partitions. In Figure 2.2 the vertex v in the partition 1 act as the master vertex; it collects messages from the vertex b and g present in the partition 2. Aggregation of messages means that the messages from the vertex b and g are combined and then sent to the vertex v, whereas for no aggregation these messages are sent separately.
Figure 2.2: No Aggregation and Aggregation of Messages in Edge Partitioning
In case of no aggregation, the communication cost between the copies of a vertex is equal to the number of its neighbors in the partitions other than the one containing the master vertex. In Figure 2.2 this cost is two as the vertex v has two neighbors in the partition 2. However, with aggregation this cost is equal to the number of partitions containing the vertex copies, which in the above case is one as there is only one partition containing the copy of the vertex v other than the one containing the master vertex.
A large number of vertex-cuts increase the communication cost for the vertex having its replicas in diﬀerent partitions. The aim of good vertex partitioning algorithms is to reduce the vertex-cuts and balance the computation load among the computing nodes.
Before explaining the partitioning algorithms in the next section, it is important to understand the power-law  of graphs, since it impacts the partitioning problem. According to the graph theory research  on natural graphs, most of the real world graphs like the World Wide Web, social network graphs, communication graphs, biological system graphs and many others have a degree distribution that follows the power-law. Therefore, we evaluated our partitioning algorithms on power-law graphs with an aim to partition natural graphs eﬃciently by minimizing the communication and the computation cost. Power graphs are diﬃcult to partition due to their highly skewed nature [20,22]. The challenges faced in partitioning power-law graphs are mentioned in detail in . As far as we know, the Power Graph Greedy Vertex-Cut  algorithm is one of the most eﬃcient algorithm to partition the power-law graphs, our custom Degree based graph partitioning algorithm is based on this algorithm for partitioning power-law graphs eﬃciently using the vertex-cut approach.
According to the power-law, for a given vertex V, the probability of this vertex having the degree d is given by d≠– (2.1) where, – = positive constant.
The constant – controls the skewness of the degree of the graph. To give an intuitive idea of power-law graphs, think of a social network where celebrities have more followers or friends than other people, but the number of common people exceeds the number of celebrities. This means there are more nodes with a low degree than the ones with a high degree.
Partitioning Power-Law Graphs
In this section we compare the vertex partitioning technique with the edge partitioning technique on power-law graphs. Our thesis implements both these approaches for understanding how these techniques work on natural graphs.
The traditional vertex partitioning approach is not suitable for power-law graphs because tools like [28,29], that create balance edge-cut partitions, perform ineﬃciently [20,21,22] on power-law graphs. The reason for this is that in vertex partitioning, the edge-cuts create a network and a storage overhead because a copy of the adjacency information (the information of an edge between the partitions along with the source and the destination vertex for that edge) is maintained in both partitions. Some approaches like  maintain a ghost vertex, which is a local copy of the vertex, and the edge data for each edge-cut. As shown in Figure 2.3 two ghost vertices and the edge data is maintained in the partitions. In case of a change in the vertex or the edge data, the change must be communicated to all the partitions containing the vertex and the edge data.
The PowerGraph  abstraction proposed an edge partitioning approach for natural graphs. In this proposed approach the edges are stored only once in the partitions, so a change in the edge data does not need to be communicated across the partitions. However, the vertex copies are maintained in diﬀerent partitions; therefore a change in the vertex must be copied to all partitions containing the vertex copies as shown in Figure 2.4.
Figure 2.4: Vertex Partitioning and Vertex Copies
According to the vertex-cut approach proposed in the PowerGraph  abstraction, it would be better to partition the high-degree vertices, as they are less in number to reduce the replication. However, partitioning the low-degree vertices will increase the replication of vertices due to their large quantity. Our degree based partitioning method uses the degree information of the vertices to partition the high-degree vertices.
Large graphs, like social media graphs, can be processed eﬃciently in a distributed set-up because it is hard to process them on a single commodity machine due to its limited processing power and storage capacity. Therefore, for distributed processing, these graphs need to be partitioned across several computing nodes. They can be partitioned using vertex partitioning or edge partitioning. Traditional partitioning methods were oﬄine, but our work is based on the implementation of partitioning methods that are online. The motivation behind using online partitioning methods is that oﬄine partitioning methods like METIS  need to observe the whole graph before partitioning. Thus, creating a high computation cost, whereas the online partitioning methods work on-the-fly reducing the computation cost. These online partitioning methods are based on the stream partitioning  approach. It partitions the data as it arrives only by knowing the current state of the data instead of knowing about the data that will arrive in the future. This technique makes computations faster. In our case as we partition graph streams, so the input is in the form of a vertex stream or an edge stream. This method of partitioning is called streaming graph partitioning . We have implemented partitioning algorithms for the graph streams, the partitioning is done on-the-fly and in one-pass to reduce the computation cost.
To the best of our knowledge, the algorithms we chose to implement are some of the eﬃcient vertex and edge stream partitioning algorithms in terms of reducing the communication and the computation cost across the computing nodes. Vertex partitioning algorithms include: Linear Greedy  and Fennel . Edge partitioning algorithms include: Least Cost Incremental , Least Cost Incremental Advanced  and our own variation of Power Graph Greedy Vertex-Cuts  called as Degree based partitioner. We are interested to know how these diﬀerent partitioning techniques perform on the graph streams by evaluating their partitioning quality metrics like the cut-costs and load balancing.
Assumptions: We consider a streaming graph which is represented by G = (V, E), having the total number vertices n and the total number of edges m. The vertices and the edges of a graph arrive in the form of stream, which is partitioned using the parti-tioning algorithms. All of these algorithms are one-pass algorithms; they take decision on-the-fly, providing low-latency. Furthermore, once a vertex or an edge is assigned to a partition, it cannot be reassigned to another, making the assignment irrevocable. Reassigning the vertex or edge increases the communication cost as the data needs to be transferred to other partitions in case of re-assignment.
Algorithms for Vertex Stream
In this section we give the details of the algorithms for partitioning vertex streams. The input is in form of a vertex stream, where each vertex has a unique vertex ID, and its neighboring vertices’ IDs.
This algorithm is regarded as the most eﬃcient one in terms of having less edge-cuts, from the algorithms that were first introduced for streaming graph partitioning . It follows a greedy approach for partitioning; the vertices, as they arrive, are sent to the partition which has most of its neighbors. There is also a penalty factor involved based on the load of the partition for load balancing.
Table of contents :
1.1 Problem Statement
1.4.1 Observation and Requirements Gathering
1.4.2 Design and Development
1.4.3 Testing and Evaluation
1.5 Structure of Thesis
2 Graph Partitioning
2.1 Partitioning Techniques
2.1.1 Vertex Partitioning
2.1.2 Edge Partitioning
2.2 Power-Law Graphs
2.2.1 Partitioning Power-Law Graphs
2.3 Partitioning Algorithms
2.3.1 Algorithms for Vertex Stream
22.214.171.124 Linear Greedy
2.3.2 Algorithms for Edge Stream
126.96.36.199 Least Cost Incremental
188.8.131.52 Least Cost Incremental Advanced
184.108.40.206 Degree Based Partitioner
2.4 Feature Comparison
3.1 Data Stream Processing
3.1.1 Data Stream Processing Models
3.1.2 Data Stream Approximation Strategies
3.2 Graph Stream Processing
3.2.1 Graph Stream Models
3.2.2 Graph Stream Representations
3.3 Apache Flink
3.3.1 Flink as Data Processing Engine
3.3.2 Flink Streaming API
3.3.3 Flink Graph Processing API
3.3.4 The Graph Streaming API for Flink
220.127.116.11 Implemented Algorithms
4.1 Stream Order
4.2 Vertex Stream
4.3 Edge Stream
4.4.1 Vertex Stream Partitioning Algorithms
4.4.2 Edge Stream Partitioning Algorithms
5.1 Input Data Streams
5.2 Experimental Setup
5.3 Partitioning Algorithms
5.3.1 Execution Time
5.3.3 Replication Factor
5.3.4 Load Balancing
5.4.1 Size of Aggregate States
5.5 Evaluation Summary
6.1 Future Work