The Virtual Network Function Chaining Problem

Get Complete Project Material File(s) Now! »

Stream Processing Platforms Comparison

Distributed real-time stream processing systems is a recent topic that is gaining a lot of attention from researchers. Hence, performance evaluations and comparisons between stream processing systems are fairly unexplored in the scientific literature. Hesse and Lorenz compare the Apache Storm, Flink, Spark Streaming, and Samza platforms [34]. The comparison is restricted to description of the architecture and its main elements. Gradvohl et. al analyze and compare Millwheel, S4, Spark Streaming, and Storm systems, focusing on the fault tolerance aspect in processing systems [35]. Actually, these two above cited paper are restricted to conceptual discussions without experimental performance evaluation. Landset et. al perform a summary of the tools used for process big data [36], which shows the architecture of the stream processing systems. However, the major focus is in batch processing tools, which use the techniques of MapReduce. Roberto Colucci et. al show the practical feasibility and good performance of distributed stream processing systems for monitoring Signaling System number 7 (SS7) in a Global System for Mobile communications (GSM) machine-to-machine (M2M) application [37]. They analyze and compare the performance of two stream processing systems: Storm and Quasit, a prototype of University of Bologna. The main result is to prove Storm practicability to process in real time a large amount of data from a mobile application.
Nabi et. al compare Apache Storm with IBM InfoSphere Streams platform in an e-mail message processing application [38]. The results show a better performance of InfoSphere compared to Apache Storm in relation to throughput and CPU utilization. However, InfoSphere is an IBM proprietary system and the source code is unavailable. Lu et. al propose a benchmark [39] creating a first step in the experimental comparison of stream processing platforms. They measure the latency and throughput of Apache Spark and Apache Storm. The paper does not provide results in relation to Apache Flink and the behavior of the systems under failure.
Dayarathna e Suzumura [40] compare the throughput, CPU and memory consumption, and network usage for the stream processing systems S, S4, and the Event Stream Processor Esper. These systems diﬀer in their architecture. The S system follows the manager/workers model, S4 has a decentralized symmetric actor model, and finally Esper is software running on the top of Stream Processor. Although the analysis using benchmarks is interesting, almost all evaluated systems are already discontinued or not currently have significant popularity. Unlike the most of above-mentioned papers, we focus on open-source stream processing systems that are current available such as Apache Storm, Flink, and Spark Streaming [41, 42]. We aim at describing the architectural diﬀerences of these systems and providing experimental performance results focusing on the throughput and parallelism in a threat detection application on a dataset created by the authors.

Real-Time Threat Detection

Some proposals use Apache Storm stream processing tool to perform real-time anomaly detection. Du et al. use the Flume and Storm tool for traﬃc monitoring to detect anomalies. The proposal is to make the detection through the k-NN algorithm [43]. The article presents some performance results, but it lacks evaluation of the accuracy of detection and the tool only receive data from a centralized node, ignoring data from distributed sources. The work of Zhao et al. uses the Kafka and Storm, as well as the previous work, for the detection of network anomalies [44], characterizing flows in the NetFlow format. He et al. propose a combination of the distributed processing platforms Hadoop and Storm, in real time, for the detection of anomalies. In this proposal, a variant of the k-NN algorithm is used as the anomaly detection algorithm [45]. The results show a good performance in real time, however without using any process of reaction and prevention of the threats. Mylavarapu et al. propose to use Storm as a stream processing platform for intrusion detection [46].
Dos Santos et al. uses a combination of Snort IDS and OpenFlow to create Of-IDPS. Snort IDS is used as a detection tool, while OpenFlow actions perform the mitigation or prevention of detected attacks [47]. An evolution of of-IDPS was proposed to develop an Autonomous Computation (AC) system to automatically create security rules in Software Defined Network (SDN) switches [6]. Rules are created applying a machine learning algorithm to Snort IDS alerts and OpenFlow logs. The machine learning method used in this work is the FP-Growth to find frequent item sets, also called association rules. Schuartz et al. propose a distributed system for threat detection in Big Data traﬃc [48]. Apache Storm and Weka machine learning tool are used to analyze KDD-99 dataset. The system is based in lambda big data architecture that combines batch and stream processing. Stream processing platforms have been used for security initiatives. Apache Metron1 is a security analysis framework based on big data processing. Metron architecture consists of acquisition, consumption, distributed processing, enrichment, storage and visualization of the data layers. The key idea of this framework is to allow the correlation of security events from diﬀerent sources. To this end, the framework employs distributed data sources such as sensors in the network, action logs of active network security elements and telemetry sources. The framework also relies on a historical foundation of network threats from Cisco. Apache Spot2 is a project similar to Apache Metron still in incubation. Apache Spot uses telemetry and machine learning techniques for packet analysis to detect threats. The creators say that the big diﬀerence with Apache Metron is the ability to use common open data models for networking. Stream4Flow3 uses Apache Spark with the ElasticStack stack for network monitoring. The prototype serves as a visualization of network parameters. Stream4Flow [49], however, has no intelligence to perform anomaly detection. Hogzilla4 is an intrusion detection system (IDS) with support for Snort, SFlows, GrayLog, Apache Spark, HBase and libnDPI, which provides network anomaly detection. Hogzilla also allows to realize the visualization of the traﬃc of the network.
The proposed CATRACA tool, like Metron, aims to monitor large volumes of data using flow processing. The CATRACA tool is implemented as a virtualized network function (VNF) in the Open Platform for Network Function Virtualization (OPNFV) environment. CATRACA focuses on real-time packet capture, feature selection and machine learning. CATRACA can be combined with a mechanism of action for immediate blocking of malicious flows. Thus, the CATRACA tool acts as a virtualized network intrusion detection and prevention function that reports flow summaries and can be linked to other network virtualized functions [50] as defined in the network function chain patterns (Service Function Chaining – SFC) and network service headers (Network Service Header – NSH).

Virtual Network Function

Machine learning is used for attack detection in virtualized environments [51, 52]. Azmandian et al. present an application based on machine learning to automatically detect malicious attacks on typical server workloads running on virtual machines. The key idea is to obtain the feature selection by Sequential Floating Forward Selection (SFFS) algorithm, also known as Floating Forward Search, and, then, classify the attacks with the K-Nearest Neighbor (KNN) and the Local Outlier Factor (LOF) machine learning algorithms. The system runs in one physical machine under VirtualBox environment. Li et al. present cloudmon [52], a Network Intrusion Detection System Virtual Appliance (NIDS-VA), or virtualized NIDS. Cloudmon enables dynamic resource provisioning and live placement for NIDS-VAs in Infrastructure as a Service (IaaS) cloud environments. The work uses Snort IDS and Xen hypervisor for virtual machine deployment. Moreover, Cloudmon uses fuzzy model and global resource scheduling to avoid idle resources in a cloud environment. The proposal employs the conventional Snort IDS, based on signature method, to detect misuse and focuses on the resource allocation. BroFlow covers the detection and mitigation of Denial of Service (DoS) attacks. Sensors run in virtual machine under Xen hypervisor, and, thus, include a mechanism for optimal sensor distribution in the network [16]. An attack mitigation solution, based on Software Defined Networking, complements the proposal, focusing on DoS attacks detection based on an anomaly algorithm implemented in the Bro IDS.
CATRACA is proposed as a virtualized network function on Open Source Platform for Network Function Virtualization (OPNFV) that provides a threat detection facility. The function employs open source tools to detect threats in real time using flow processing and machine learning techniques.
The problem of specific sensor placement is studied by Chen et al.. The authors propose a technique based on Genetic Algorithms (GA) [53] for sensor placement. The proposed algorithm has as heuristic the minimization in the sensor number and the maximization of the detection rate. Bouet et al. also use GA as optimization technique for the deployment of Deep Packet Inspection (DPI) virtual sensors [54]. Bouet proposal minimize the sensor number and the load analyzed by each sensor, however, this proposal based on GA requires high processing time to obtain the results without warranting the solution convergence [55]. We model and propose an heuristic for optimization in VNF sensor placement, reducing the number of sensor and maximizing the network coverage [56, 57].

READ The concept of Enterprise Resource Planning (ERP)

Service Chaining

Virtual Network Function chaining is currently a trend topic in research. Several researches deal with the optimization problem to place a set of VNFs [22–24]. Addis et al. propose a mixed integer linear programming formulation to solve the VNF placement optimization from the Internet Service Providers (ISPs) point of view [23]. In a similar way, Bari et. al use a Integer Linear Programming in order to optimize the cost of deploying a new VNF, the energy cost for running a VNF, and the cost of forwarding traﬃc to and from a VNF [22]. A Pareto optimization is used for placing chained VNFs in an operator’s network with multiple sites, based on requirements of the tenants and of the operator [24].
Other works propose the optimization placement of specific VNF [16, 27, 58]. A virtual Deep Packet Inspection (vDPI) placement is proposed by Bouet et. al. to minimize the cost that the operator faces [58]. In a previous work [16], we proposed the placement of an Intrusion Detection and Prevention System (IDPS) by a heuristic that maximize the traﬃc passing thought each node. In another previous work [27], we proposed a heuristic to optimize the placement of distributed network controllers in a Software Defined Network environment. Nevertheless, none of these works considers the trade-oﬀ of the costumers’ requests and infrastructure provider availability.
Estimating resource usage for optimizing allocation has been proposed in many other contexts. Sandpiper [59] is a resource management tool for datacenters. It focuses on managing the allocation of virtual machines over a physical infrastructure. Other proposal that estimates the resource usage for allocating virtual machines in a datacenter is Voltaic [60]. Voltaic is a management system focused on cloud computing which aims to ensure compliance with service level agreements (SLAs) and optimize the use of computing resources.
In Section 6.3, we propose four heuristics in order to minimize the delay between source and destination nodes for the best costumer Quality of Experience (QoE). Another heuristic is proposed to minimize the resource usage on the network nodes to increase Infrastructure Provider (IP) benefits. Finally, a heuristic for using the most central nodes first to improve costumer QoE and IP benefit. We compare the four proposed heuristics with a greedy algorithm and we tested over a real Service Provider topology [61].

Threat Detection using Stream Processing

In this chapter, we present a threat detection prototype using stream processing. First, we present the main data processing techniques. Then, we introduce the stream processing paradigm. Next, we describe and compare the main Open-Source stream processing platforms in order to select the most suitable for our Network Analytics tool. Finally, we present the CATRACA tool, a network monitoring and threat detection tool using stream processing and machine learning techniques.

Methods of Data Processing

Stream processing makes it possible to extract values on moving, as batch processing does for static data. The purpose of stream processing is to enable real-time, or near-real-time, decision making by providing the ability to inspect, correlate, and analyze stream data as data flows through the processing system. Examples of scenarios that require stream processing are: traﬃc monitoring applications for computer network security; social networking applications such as Twitter or Facebook; financial analysis applications that monitor stock data flows reported on stock exchanges; detection of credit card fraud; inventory control; military applications that monitor sensor readings used by soldiers, such as blood pressure, heart rate, and position; manufacturing processes; energy management; among others. Many scenarios require processing capabilities of millions or hundreds of millions of events per second, making traditional system such as Data Base Management System (DBMS) inappropriate to analyze stream data [62]. Data Base Management Systems store and index data records before making them available to the query activity, which makes them unsuitable for real-time applications or responses in the sub-second order [63]. Static databases were not designed for fast and continuous data loading. Therefore, they do not directly support the continuous processing that is typical of data stream applications. Also, traditional databases assume that the process is strictly stationary, diﬀering from almost all real-world applications, in which the output could gradually change over time. Security Threats in TCP/IP networks are a typical example of moving data, in which the output changes over time.
Data processing is divided in three main processing approaches: batch, micro-batch, and stream. The analysis of large sets of static data, which are collected over previous periods, is done with batch processing. A famous technique that use batch processing is the MapReduce [12], with the popular open-source implementation Hadoop [13]. In this scheme, data is collected, stored in files, and, then, processed, ignoring the timely nature of the data production. However, this technique presents large latency, with responses greater than 30 seconds, while several applications require real-time processing, with responses in microsecond order [64]. Despite, this technique can perform near real-time processing by doing micro-batch processing. Micro-batch treats the stream as a sequence of smaller data blocks. For minor intervals, the input is grouped into data blocks and delivered to the batch system to be processed. On the other hand, the third approach, stream processing, analyzes massive sequences of unlimited data that are continuously generated [65].
Stream Processing diﬀer from the conventional batch model in: i) the data elements in the stream arrive online; ii) the system has no control over the order in which the data elements arrive to be processed; iii) stream data are potentially unlimited in size; iv) once an element of a data stream has been processed, it is discarded or archived and cannot be retrieved easily, unless it is explicitly stored in memory, which is usually small relative to the size of the data streams. Further, latency of stream processing is better than micro-batch, since messages are processed immediately after arrival. Stream processing performs better for real time; however, fault tolerance is costlier, considering that it must be performed for each processed message.
Table 4.1 summarizes the main diﬀerences between static batch processing and moving data stream processing.
Both these paradigms, batch and stream processing, are combined in the lambda architecture to analyze big data in a real-time manner [66]. Lambda architecture uses a stream fast path for timely approximate results, and a batch oﬄine path for late accurate results. In the lambda architecture, stream data can be used to update batch processing parameters of an oﬀ-line training for a real-time threat detection. The lambda architecture combines traditional batch processing over a batch processing: stream processing, batch processing, and service layers.
As shown in Figure 4.1, the lambda architecture has three layers: the stream processing layer, the batch-processing layer, and the service layer. The stream processing layer deals with the incoming data in real-time. The batch-processing layer analyzes a huge amount of stored data in a distributed way through techniques such as map-reduce. Finally, the service layer combines the obtained information of the two previous layers to provide an output composed by analytic data to the user. Therefore, the lambda architecture goal is to analyze, accurately and in real-time, stream data, even with an ever-changing incoming rate to obtain results in real-time based on historical data.

The Stream Processing

The data flow processing is modeled through a Directed Acyclic Graph (DAG). The graph is composed by source data node which continuously emit samples, and interconnected processing nodes. A data stream ψ is an unbounded set of data, ψ = {Dt|t > 0} where a point Dt is a set of attributes with a time stamp. Formally, one data point is Dt = (V, τt), where V is a n-tuple, in which each value corresponds to an attribute, and τt is the time stamp for the t-th sample. Sources nodes emit tuples or messages that are received by Processing Elements (PE). Each PE receives data on its input queues, performs computation using local state and, finally, produces an output to its output queue. Figure 4.2 shows the conceptual stream processing system architecture.

Table of contents :

1 Introduction
1.1 Objectives
1.2 Text Organization
2 Conclusion
2.1 Future Work
3 Related Work
3.1 Stream Processing Platforms Comparison
3.2 Real-Time Threat Detection
3.3 Virtual Network Function
3.4 Service Chaining
4 Threat Detection using Stream Processing
4.1 Methods of Data Processing
4.2 The Stream Processing
4.3 Stream Processing Platforms
4.3.1 Apache Storm
4.3.2 Apache Flink
4.3.3 Apache Spark Streaming
4.4 Performance Evaluation of the Platforms
4.4.1 Experiments Results
4.5 The CATRACA Tool
4.5.1 CATRACA Architecture
5 Dataset and Feature Selection
5.1 Security Dataset Creation
5.2 Feature Selection and Dimensionality Reduction
5.2.1 Feature Selection
5.2.2 Correlation Based Feature Selection
5.2.3 Case of Use: Traffic Classification
5.2.4 Classification Results
5.2.5 Preprocessing Stream Data
6 The Virtual Network Function
6.1 The Network Function Virtualization
6.1.1 The Open source Platform for Network Function Virtualization (OPNFV)
6.1.2 Threat-Detection Prototype Performance Evaluation
6.2 Modeling and Optimization Strategy for VNF Sensor Location
6.2.1 Optimal VNF Sensor Placement
6.3 The Virtual Network Function Chaining Problem
6.3.1 The Proposed VNF Chaining Scheme
6.4 The Evaluation of the Proposal
7 Conclusion
7.1 Future Work
Bibliography