Packet Demultiplexing and Delivery
When delivering packets to tenant’s domains in multi-tenant networks, the last few processing steps are often performed at end hosts, on hardware shared with the tenant’s workloads. Hence, their eciency is critical. In this section, we describe these processing steps and survey recent works to improve their eciency in multi-tenant setups. We focus on the receive path, but the transmit path has similar processing steps, in a reverse order.
To enforce isolation between virtual machines, virtualization platforms must intercept all I/O operations. Networking is no exception: on the receive path, packets are demultiplexed to virtual machines based on their headers; on the transmit path, the hypervisor validates packets to prevent malicious behaviors (e.g., spoong attacks or oods).
In the Xen virtualization platform , a privileged virtual machine, the host domain, is responsible for virtualizing I/O accesses. When packets arrive at the Network Interface Card (NIC), an interrupt is rst routed to the hypervisor, which noties the host domain. The NIC then DMAs the packet to the host domain’s memory. At that point, the host domain can inspect headers to determine the destination guest domain (virtual machine) for that packet.
In the original version of Xen, packets were delivered to virtual machines by exchanging memory pages between the host domain and the guest domain, a technique often referred to as page ipping or page remapping . A. Menon et al.  showed that the cost of mapping and unmapping memory pages for the exchange equals the cost of a large packet (1500B) copy. Page ipping was therefore abandoned in subsequent versions of Xen in favor of packet copies. In , J. Santos et al. proposed a number of implementation and architectural improvements to Xen’s networking path and performed an in-depth analysis of its CPU cost. Two architectural changes had a decisive role in the performance improvements:
The guest’s CPU becomes responsible for copying packets from the host memory to the guest’s memory. Before this change the host’s CPU was copying packets to the guest domain. Since the guest’s CPU is likely to read packets again afterwardif only to copy them to the guest’s userspace, this change improves cache hit rates.
In Xen, the guest grants the host domain the right to write to a few of its memory pages, in order to receive packets. J. Santos et al. removed the need to perform a new grant request per packet; the guest can now recycle pages previously granted to the host domain.
These two design changes stood the test of time and were retained in the more recent paravirtualized virtio driver . In virtio, the host requests a few memory pages to write incoming packets. The guest’s driver then copies packets to its own memory when it allocates its internal packet data structure (e.g., sk_buff in the Linux kernel).
Network Function Virtualization. With the advent of network function virtualization, researchers focused on improving virtualization performance for network I/O-bound workloads. ClickOS  and NetVM  are two virtual platforms for network intensive workloads that come with revamped vNICs, software switches, and guest operating system designs. ClickOS is based on Xen with many of J. Santos et al.’s improvements. On the host domain side, ClickOS relies on the VALE software switch  with two threads to poll packets from the NIC to the virtual machines and vice versa. On the guest side, they use a version of the MiniOS unikernel tailored for packet processing. In particular, they run a single thread in MiniOS to poll packets from the VALE vNIC. Since unikernel OSes have a single address space, using the MiniOS unikernel removes the need for an additional kernel to userspace copy and signicantly boosts performance. To implement network functions, they rely on the Click packet processing framework .
They report a 14.2 Mpps forwarding speed through a virtual machine using 3 dedicated threads, one polling from the NIC on the host, the second polling from the vNIC in the virtual machine, and the last polling from the vNIC on the host and sending packets back through the NIC. The setup used in ClickOS’s evaluation is illustrated in Figure 2.1, along with NetVM’s. Published the same year, NetVM took a fairly dierent approach. NetVM is based on the KVM hypervisor and uses a userspace packet processing library, DPDK , to poll packets from the physical and virtual NICs. In addition, where ClickOS uses VALE, NetVM comes with its own demultiplexing logic. More importantly, NetVM has a zero-copy design in which packets are DMAed to hugepages on the host, and virtual machines can read packets from these hugepages without copying them. This zero-copy design comes at the cost of isolation as any virtual machine can access all packets received from the NIC. In the design of NetVM, the authors mention, but do not implement nor evaluate, the possibility to isolate several trust groups.
They report a throughput of 14.88 Mpps with four dedicated threads, two polling from the NIC on the host on the receive path and from the vNICs on the transmit path, two in the virtual machine to poll packets from the vNIC, process them, and send them back to the host. The authors doubled the number of polling threads to dedicate one to each of the two NUMA nodes of their system. Their evaluation is however limited by the 10Gbps NIC. In addition, their evaluation doesn’t lend itself easily to comparison with ClickOS: they use one more polling thread and a signicantly dierent CPU4. Taking the hardware dierences into account, and given that NetVM requires one less copy per packet than ClickOS, it would likely still be able to saturate the 10Gbps NIC with a single NUMA node and two threads.
Besides the bare task of delivering packets to tenant’s domains, end host must also decide whether and where to forward packets, on the receive path as well as the transmit path. In practice, because of the large number of virtual machines or containers per host and the complexity of network policies, executing the logic to decide whether and where to forward packets can be expensive. In this section, we survey works on the software switch, the end-host component in charge o executing that logic. We begin with a brief discussion on the evolution over the last decade of the role of the software switch in multi-tenant networks. We then review the literature on packet classication algorithms, algorithms to execute the aforementioned logic. Finally, we discuss the challenge of extending software switches, a problem which we address in Chapter 3.
In the rst virtualization platforms , networking between virtual machines and the physical network was managed at the link layer (and, less frequently, at the network layer ). A software component of the hypervisor would therefore demultiplex packets to virtual machines based on the Ethernet addresses and VLAN tags. This component, generally referred to as the virtual switch,could also enforce policies, such as rate-limiting trac or ltering outbound packets to prevent spoong. Other approaches were proposed to process packets in hardware with higher performance .
The NIC  or the upstream top-of-rack (ToR) switch can for example enforce policies at much higher speeds than the host’s CPU. These approaches, however, are limited by the PCIe bandwidth, and in the case of the ToR switch, require additional tagging of packets and demultiplexing at the hypervisor.
Among hardware ooads, we identify and discuss three generations of ooading features. These evolutions are motivated by the continuous demand for high performance and the emergence of new processing workloads.
Protocol Ooads. The rst ooads from the CPU to the NIC focused on specic network and transport-layer protocol computations, often targeting the widespread TCP/IP suite and encapsulations thereof.
In , after a discussion of the challenges associated with checksum ooading for TCP/IP, J. S. Chase et al. evaluate its performance benets. Although largely taken for granted today, TCP checksum ooading is not straightforward for NICs to support. Computing TCP checksums involves parsing the IP header, handling the computation of a checksum over several fragments, and ensuring checksums are computed before sending packets on the wire (thereby preventing cut-through switching).
Several researchers and hardware vendors proposed to fully ooad TCP/IP processing to the NIC, using TCP Ooad Engines (TOEs). Full support for TCP/IP in NIC was never widely adopted and was notably rejected from the Linux kernel . In , although J. Mogul argues TOE should be reconsidered with the advent of storage protocols over IP, he makes a great case against TOE by detailing the arguments against. These arguments mostly pertain to the limited performance benet compared to partial ooads (checksumming and TCP segmentation ooad), the complexity of the implementation, and more generally, the inexibility of hardware implementations.
However, even partial TCP/IP ooads are fairly fragile in the presence of encapsulation. For example, in , T. Koponen et al. explain that IP encapsulation prevents protocol ooads because the NIC is unable to parse the inner headers. To overcome this limitation, they propose a new encapsulation scheme, STT , with a fake TCP header after the outer IP header to enable TCP ooads.
Table of contents :
Chapter 1 Introduction
1.1.1 Network Functions in Cloud Platforms
1.1.2 Performance Challenges
1.1.3 Opportunities at the Infrastructure Layer
1.2 Contributions and results
1.3 Overview of the thesis
Chapter 2 Multi-Tenant Networking Architectures
2.1 Packet Demultiplexing and Delivery
2.1.2 Operating System
2.1.3 Software Memory Isolation
2.2 Software Switches
2.2.1 Packet Switching
2.2.2 Forwarding Pipelines
2.3 Packet Processing Ooads
2.3.1 Hardware Ooads
2.3.2 Ooading Tenant Workloads
Chapter 3 Software Switch Extensions
3.2 Design Constraints
3.3 Oko: Design
3.3.1 Oko Workow
3.3.2 Safe Execution of Filter Programs
3.3.3 Flow Caching
3.3.4 Control Plane
3.4 Filter Program Examples
3.4.1 Stateless Signature Filtering
3.4.2 Stateful Firewall
3.4.3 Dapper: TCP Performance Analysis
3.5.1 Evaluation Environment
3.5.3 End-to-End Comparisons
3.6 Related Work
Chapter 4 Ooads to the Host
4.2 Background on High-Performance Datapaths
4.3.1 Ooad Workow
4.3.2 Program Safety
4.3.3 Run-to-Completion CPU Fairness
4.3.4 Per-Packet Tracing of CPU Shares
4.4.1 In-Driver Datapath
4.4.2 Userspace Datapath
4.5 Ooad Examples
4.5.1 TCP Proxy
4.5.2 DNS rate-limiter
4.6.1 Evaluation Setup
4.6.2 Fairness Mechanism
4.6.3 Performance Gain
4.7 Related Work
Chapter 5 Conclusion
5.1 Beyond Datacenter Networks
5.2 Runtime Software Switch Specialization
5.3 Low-Overhead Software Isolation for Datapath Extensions
5.4 The Heterogeneity of Packet Processing Devices
Appendix A Example Filter Program Trees
Appendix B Example BPF Bytecode
Annexe C French Summary