Enhancing the Data Plane with ICN
Information Centric Networking is a communication paradigm that has recently been proposed in the research community. The main goal is to improve Network Layer to better adapt it to current Internet’s usage. This is feasible when some mechanisms like forwarding and routing are based on names rather than locations. The new proposed network has some advantages, which can be summarized as: network traffic reduction, native support for advanced functionalities such as loop detection, multipath forwarding [AD12].
The ICN model inspired few research activities: among all, the Named Data Networking (NDN) architecture proposed by Van Jacobson [JSB+09] and Zhang [ZEB+10] is the most popular. In the following we focus on NDN, showing its architecture, the implementation, as well as its main features and the challenges that arise from such approach.
Architecture of NDN
In NDN, the focus is on content, including videos, images, data, but also services and resources, which are represented by some information in the network. In particular, a content item is any piece of information that a user may request. It is split in multiple segments called content chunks. Each content chunk is marked with a unique identifier called content name. The content name is used to request all chunks of a given content item and perform the data forwarding in the NDN network. The forwarding mechanism which takes into account a content name instead of the location of a node is called name-based forwarding.
There are two main naming schemes used to map content chunks to their corresponding name [Bar12]. The choice of the naming scheme may affect some properties of the NDN network, like scalability, self-certification and also forwarding simplicity.
In a hierarchical scheme, a content name consists of a set of components, separated by a delimiter, followed by a chunk identifier. The components can be human-readable strings that are separated by a delimiter character, similarly to what happens with web’s URLs. A possible example for such a name is <fr/inria/students/AliceBob/2>, which represents the chunk with id equals to 2 of the content fr/inria/students/AliceBob. Any subset of contiguous components is called a prefix. This scheme allows aggregation at many levels (e.g. all content names starting with the same prefixes), and it binds different prefixes with different scopes: that is, the same local content name may be reused within a different domain (a different prefix). Conversely it makes it hard to perform operations such as a lookup on a table using the content name as key. This naming scheme is the one adopted by NDN.
Forwarding module: design and implementation
In this section, we detail our two-stage name-based LPM algorithm used in the forwarding module of Caesar. Section 3.3.1 and 3.3.2 describe the prefix Bloom filter (PBF) and the concept of block expansion, respectively. Both are used in the first stage to find the length of the longest prefix match. Section 3.3.3 shows our technique to reduce the number of hash values computation. Section 3.3.4 describes the hash table used in the second stage, and the optimizations introduced to speed up the lookups.
Prefix Bloom Filter
The first stage of our name-based LPM algorithm consists in finding the greatest length of a prefix matching an incoming Content name. To perform this lookup, we introduce a novel data structure called prefix Bloom filter (PBF). The PBF exploits the hierarchical semantics of content prefixes to find the longest prefix match using (most of the times) a single memory access.
Similarly to the Bloom Filter, PBF is a space-efficient data structure which is made of several blocks. Each block is a small Bloom filter: its size is chosen to fit one or multiple CPU cache lines of the memory in which it is stored (e.g. DRAM caches, not to be confused with CPU caches, are usually 64-128 MBytes, while CPU caches are typically smaller than 2 MB). During the population of the FIB, a content prefix is stored both in the FIB and into a block chosen from the hash value of its first component1. During the lookup of a content name, its first component identifies the unique block that must be loaded from memory to cache before conducting the membership query. The possible load unbalancing issues due to that choice are investigated in Section 3.3.2.
A PBF comprises b blocks, m bits per block, with k hash functions. The overall size of the PBF is M = b m. A scheme of the insertion of a content prefix is shown in Figure 3.1. Let p = =c1=c2= : : : =cd be a d-component prefix to be inserted into the PBF. The hash value resulting from a hash function (chosen on purpose to be as much uniform as possible) g() with output in the range f0; 1; : : : ; b 1g determines the block where p should be inserted. Let now pi = =c1=c2= : : : =ci be a subprefix with the first i components of p, such that p1 = =c1, p2 = =c1=c2, p3 = =c1=c2=c3, and so on. We choose a block in the PBF identified by the value g(p1), which is the result of the hash computation performed from the subprefix p1 defined by the first component. This implies that all prefixes starting with the same component are stored in the same block, which enables fast lookups (the possible load balancing issues are discussed in Section 3.3.2). Once the block is selected, the values h1(p); h2(p); : : : ; hk(p) are computed using the entire prefix p, resulting in k indexes within the range f0; 1; : : : ;m 1g. Finally, the bits of positions h1(p); h2(p); : : : ; hk(p) in the selected block are set to 1.
Our Caesar extension to accelerate packet forwarding makes use of a GPU. First, a brief background on the architecture and operation of the NVIDIA GTX 580 [nG] used in our implementation is provided. Then, a discussion on our name-based LPM solution using this GPU is presented. The NVIDIA GTX 580 GPU is composed of 16 streaming multiprocessors (SMs), each with 32 stream processors (SPs) running at 1,544 MHz. This GPU has two memory types: a large, but slow, device memory and a small, but fast, shared memory. The device memory is an off-chip 1.5 GB GDDR5 DRAM, which is accelerated by a L2 cache used by all SMs. The shared memory is an individual on-chip 48 KB SRAM per SM. Each SM also has several registers and a L1 cache to accelerate device memory accesses.
All threads in the GPU execute the same function, called kernel. The level of parallelism of a kernel is specified by two parameters, namely, the number of blocks and the number of threads per block. A block is a set of concurrently executing threads that collaborate using shared memory and barrier synchronization primitives. At run-time, each block is assigned to a SM and divided into warps, or sets of 32 threads, which are independently scheduled for execution. Each thread in a warp executes the same instruction in lockstep.
An application that aims to offload processing to a GPU works as follows. First, the application copies the data to be processed from CPU to device memory. Then, the application launches a kernel; the kernel reads the input data from device memory, performs a desired operation and then write results back to device memory. Finally, the application copies results back from device memory to CPU’s memory. GPUs suite computation-intensive applications because of their extreme thread-level parallelism as well as latency hiding capabilities.
Name-based LPM: We introduce few modifications made to the LPM algorithm to achieve efficient GPU implementation. Due to the serial nature of Caesar’s processing units, the original algorithm uses a PBF to test for several prefix lengths in the same filter. However, to take advantage of the high level of parallelism in GPUs, a LPM approach that uses a Bloom filter and hash table per prefix length is more efficient. Since large FIBs are expected, both the Bloom filters and hash tables are stored in device memory.
For high GPU utilization, multiple warps must be assigned to each SM such that, when a warp stalls on a memory read, other warps are available waiting to be scheduled. The GTX 580 can have up to 8 blocks concurrently allocated and executing per SM, for a total of 128 blocks. Content prefixes are assumed to have 128 components or less, and thus we have one block per prefix length in the worst case. Since such a large number of components is rare, we allow a higher degree of parallelism with multiple blocks working on the same prefix length. In this case, each block operates on a different subset of content names received from a line card.
To keep track of the prefix lengths available in the FIB, we use a prefix length mask (PLM), a bit array where the i-th bit indicates the presence of at least one prefix with i components in the FIB. We have size(PLM) 128, matching the highest number of components that we can handle at the maximum level of parallelism. This size is chosen after finding the maximum prefix length stored in the FIB, at compile-time. When matching a content name with d components, the PLM is first checked from position d to 1. We preload the PLM to shared memory to speedup the masking of a content name. Our algorithm receives as input arrays B, H, and C that contain the Bloom filters, hash tables, and content names to be looked up, respectively. The kernel identifies the length of longest prefix in the FIB that match each content name c 2 C and stores it in the array L, which is then returned to the line card. All these arrays are located in the device memory.
Hash table dimensioning and analysis
We dimension our hash table to contain all the prefixes of the reference workload described in Section 2.4.3. The choice of the hash table size takes into account a parameter that is the load factor, defined as the ratio between the number of elements stored in the hash table and the number of buckets it contains. Given that denotes the number of buckets in our table, and n is the number of stored elements, we have = n . We remind that in our design, as described in Section 3.3.4, every bucket contains s = 8 slots.
The load factor is related to the access speed and memory usage of the hash table: for 1, the hash table is almost empty, the access speed may increase at the cost of a memory waste. For 1, the table is overloaded, and many slots per bucket are used: this increase the 4Spirent SmartBit overflow. average lookup time because many slots might have to be checked in order to find a match. Our design choice is to set = n, that is, load factor = 1. Since the size of a bucket is one cache line, it follows that the hash table requires 1.28 GB to store the buckets. Next-hops storage, as well as the content prefixes and the MAC address information, require 640 MB of memory, for a total amount of 1.92 GB.
When the FIB is populated, every prefix is stored in a slot of a bucket, selected as described in Section 3.3.4. For each prefix p, we calculate the value h(p) using the CRC32 function, and select the bucket at position h(p) mod . Then, we choose the first available slot where the prefix is eventually stored. If the bucket overflows, the last slot is used as a pointer to an appended linked list of entries. Figure 3.4 shows the distribution of the number of prefixes per bucket, calculated over the reference workload with = 10M and s = 8. We note that no bucket has more than 8 elements stored: this is important, since it shows that we do not need to manage the bucket overflows by means of the extra linked list. Finally, 99% of the buckets requires less than 4 slots occupied, and about 37% of the buckets are empty.
Table of contents :
1.1 Telecommunication Networks
1.1.1 Circuit switching and packet switching
1.1.2 Computer Networks
1.2 The Internet
1.2.1 The TCP/IP Protocol Stack
1.2.2 Data Plane and Control Plane
1.3 The evolution of the Internet
1.3.1 The behavioral issue
1.3.2 The architectural issue
1.4 Enhancing the Data Plane with ICN
1.4.1 Architecture of NDN
1.4.2 Features of ICN and NDN
1.4.3 Implementation of ICN
1.5 The SDN approach for an evolvable Network
1.5.2 Implementation of SDN
1.5.3 Features of SDN
I Data Plane Enhancement
2 Introduction to the First Part
2.1 Design principles
2.2 Design space
2.4 Methodology and testbed
2.4.2 Test equipment
3 Forwarding module
3.2 Design space
3.2.1 Related work
3.2.3 Data structure
3.3 Forwarding module: design and implementation
3.3.1 Prefix Bloom Filter
3.3.2 Block Expansion
3.3.3 Reducing the number of hashing operations
3.3.4 Hash table design
3.3.5 Caesar extensions
3.4.1 Experimental setting
3.4.2 Performance evaluation
3.4.3 Distributed Processing
3.4.4 GPU Off-load
4 PIT module
4.2 Design space
4.2.1 Related work
4.2.3 Data structure
4.2.4 Timer support
4.2.5 Loop detection
4.2.6 Parallel access
4.3 PIT: design and implementation
4.3.1 PIT placement and packet walktrough
4.3.2 Data structure
4.3.3 PIT operations
4.3.4 Timer support
4.3.5 Loop detection with Bloom filter
4.4.1 Experimental setting
4.4.2 Memory footprint
4.4.3 Throughput without timer
4.4.4 Throughput with timer
II Network Verification
5 Introduction to the Second Part
5.1 Network Verification
5.2 State of the art
6 Forwarding rule verification through atom computation
6.1.2 Header Classes
6.1.3 Set representation
6.1.4 Representation of a collection of sets
6.2 Atoms generated by a collection of sets
6.2.1 Representing atoms by uncovered combinations
6.2.2 Overlapping degree of a collection
6.3 Incremental computation of atoms
6.3.1 Computation of atoms generated by a collection of sets
6.3.2 Application to forwarding loop detection
6.4 Theoretical comparison with related work
6.4.1 Related notion of weak completeness
6.4.2 Lower bound for HSA / NetPlumber
6.4.3 Lower bound for VeriFlow
6.4.4 Linear fragmentation versus overlapping degree
Table of symbols