PacketShader: A GPU-Accelerated Software Router Sangjin Han In - PowerPoint PPT Presentation

PacketShader: A GPU-Accelerated Software Router Sangjin Han † In collaboration with: Keon Jang † , KyoungSoo Park ‡ , Sue Moon † † Advanced Networking Lab, CS, KAIST ‡ Networked and Distributed Computing Systems Lab, EE, KAIST 2010 Sep.

PacketShader: A GPU-Accelerated Software Router High-performance Our prototype: 40 Gbps on a single box 2 2010 Sep.

Software Router  Despite its name, not limited to IP routing  You can implement whatever you want on it.  Driven by software  Flexible  Friendly development environments  Based on commodity hardware  Cheap  Fast evolution 3 2010 Sep.

Now 10 Gigabit NIC is a commodity  From $200 – $300 per port  Great opportunity for software routers 4 2010 Sep.

Achilles’ Heel of Software Routers  Low performance  Due to CPU bottleneck Year Ref. H/W IPv4 Throughput 2008 Egi et al. Two quad-core CPUs 3.5 Gbps “Enhanced SR” 2008 Two quad-core CPUs 4.2 Gbps Bolla et al. “RouteBricks” Two quad-core CPUs 2009 8.7 Gbps Dobrescu et al. (2.8 GHz)  Not capable of supporting even a single 10G port 5 2010 Sep.

CPU BOTTLENECK 6 2010 Sep.

Per-Packet CPU Cycles for 10G IPv4 + = 1,800 cycles 1,200 600 Packet I/O IPv4 lookup Cycles IPv6 + = 2,800 1,200 1,600 needed Packet I/O IPv6 lookup … IPsec + = 6,600 1,200 5,400 Packet I/O Encryption and hashing Your 1,400 cycles budget 10G, min-sized packets, dual quad-core 2.66GHz CPUs (in x86, cycle numbers are from RouteBricks [ Dobrescu09] and ours) 7 2010 Sep.

Our Approach 1: I/O Optimization + = 1,800 cycles 1,200 600 Packet I/O IPv4 lookup + = 2,800 1,200 1,600 Packet I/O IPv6 lookup … + = 6,600 1,200 5,400 Packet I/O Encryption and hashing  1,200 reduced to 200 cycles per packet  Main ideas Packet I/O  Huge packet buffer  Batch processing 8 2010 Sep.

Our Approach 2: GPU Offloading + 600 Packet I/O IPv4 lookup + 1,600 Packet I/O IPv6 lookup … + 5,400 Packet I/O Encryption and hashing  GPU Offloading for  Memory-intensive or  Compute-intensive operations  Main topic of this talk 9 2010 Sep.

WHAT IS GPU? 10 2010 Sep.

GPU = Graphics Processing Unit  The heart of graphics cards  Mainly used for real-time 3D game rendering  Massively-parallel processing capacity (Ubisoft’s AVARTAR, from http: / / ubi.com) 11 2010 Sep.

CPU vs. GPU CPU: GPU: Small # of super-fast cores Large # of small cores 12 2010 Sep.

“Silicon Budget” in CPU and GPU ALU Xeon X5550: 4 cores GTX480: 731M transistors 480 cores 3,200M transistors 13 2010 Sep.

GPU FOR PACKET PROCESSING 14 2010 Sep.

Advantages of GPU for Packet Processing 1. Raw computation power 2. Memory access latency 3. Memory bandwidth  Comparison between  Intel X5550 CPU  NVIDIA GTX480 GPU 15 2010 Sep.

(1/3) Raw Computation Power  Compute-intensive operations in software routers  Hashing, encryption, pattern matching, network coding, compression, etc.  GPU can help! Instructions/sec < CPU: 43 × 10 9 GPU: 672 × 10 9 = 2.66 (GHz) × = 1.4 (GHz) × 4 (# of cores) × 480 (# of cores) 4 (4-way superscalar) 16 2010 Sep.

(2/3) Memory Access Latency  Software router  lots of cache misses  GPU can effectively hide memory latency Cache Cache miss miss GPU core Switch to Switch to Thread 2 Thread 3 17 2010 Sep.

(3/3) Memory Bandwidth CPU’s memory bandwidth (theoretical): 32 GB/ s 18 2010 Sep.

(3/3) Memory Bandwidth 3. TX: CPU  RAM 2. RX: RAM  CPU 4. TX: RAM  NIC 1. RX: NIC  RAM CPU’s memory bandwidth (empirical) < 25 GB/ s 19 2010 Sep.

(3/3) Memory Bandwidth Your budget for packet processing can be less 10 GB/ s 20 2010 Sep.

(3/3) Memory Bandwidth Your budget for packet processing can be less 10 GB/ s GPU’s memory bandwidth: 174GB/ s 21 2010 Sep.

HOW TO USE GPU 22 2010 Sep.

Basic Idea Offload core operations to GPU (e.g., forwarding table lookup) 23 2010 Sep.

Recap  For GPU, more parallelism, more throughput GTX480: 480 cores 24 2010 Sep.

Parallelism in Packet Processing  The key insight  Stateless packet processing = parallelizable RX queue 2. Parallel Processing in GPU 1. Batching 25 2010 Sep.

Batching  Long Latency?  Fast link = enough # of packets in a small time window  10 GbE link  up to 1,000 packets only in 67μs  Much less time with 40 or 100 GbE 26 2010 Sep.

PACKETSHADER DESIGN 27 2010 Sep.

Basic Design  Three stages in a streamline Pre- Post- Shader shader shader 28 2010 Sep.

Packet’s Journey (1/3)  IPv4 forwarding example • Checksum, TTL • Format check Collected • … dst. IP addrs Pre- Post- Shader shader shader Some packets go to slow-path 29 2010 Sep.

Packet’s Journey (2/3)  IPv4 forwarding example 2. Forwarding table lookup 1. IP addresses 3. Next hops Pre- Post- Shader shader shader 30 2010 Sep.

Packet’s Journey (3/3)  IPv4 forwarding example Update packets and transmit Pre- Post- Shader shader shader 31 2010 Sep.

Interfacing with NICs Packet RX Packet TX Pre- Post- Device Device Shader shader shader driver driver 32 2010 Sep.

Scaling with a Multi-Core CPU Master core Shader Device Pre- Post- Device driver shader shader driver Worker cores 33 2010 Sep.

Scaling with Multiple Multi-Core CPUs Shader Device Pre- Post- Device driver shader shader driver Shader 34 2010 Sep.

EVALUATION 35 2010 Sep.

Hardware Setup CPU: Total 8 CPU cores Quad-core, 2.66 GHz Total 80 Gbps NIC: Dual-port 10 GbE GPU: Total 960 cores 480 cores, 1.4 GHz 36 2010 Sep.

Experimental Setup Input traffic Processed packets … 8 × 10 GbE links Packet generator PacketShader (Up to 80 Gbps) 37 2010 Sep.

Results (w/ 64B packets) CPU-only CPU+GPU 39.2 38.2 40 Throughput (Gbps) 35 32 28.2 30 25 20 15.6 15 10.2 8 10 3 5 0 IPv4 IPv6 OpenFlow IPsec GPU speedup 1.4x 4.8x 2.1x 3.5x 38 2010 Sep.

Example 1: IPv6 forwarding  Longest prefix matching on 128-bit IPv6 addresses  Algorithm: binary search on hash tables [Waldvogel97]  7 hashings + 7 memory accesses … … … … Prefix length 1 64 80 96 128 39 2010 Sep.

Example 1: IPv6 forwarding Bounded by motherboard IO capacity CPU-only CPU+GPU 45 40 Throughput (Gbps) 35 30 25 20 15 10 5 0 64 128 256 512 1024 1514 Packet size (bytes) (Routing table was randomly generated with 200K entries) 40 2010 Sep.

Example 2: IPsec tunneling  ESP (Encapsulating Security Payload) Tunnel mode  with AES-CTR (encryption) and SHA1 (authentication) Original IP packet IP header IP payload ESP + IP header IP payload trailer 1. AES ESP ESP + IP header IP payload header trailer 2. SHA1 New IP ESP ESP ESP + IPsec Packet IP header IP payload header header trailer Auth. 41 2010 Sep.

Example 2: IPsec tunneling  3.5x speedup CPU-only CPU+GPU Speedup 24 4 Throughput (Gbps) 20 3.5 16 3 Speedup 12 2.5 8 2 4 1.5 0 1 64 128 256 512 1024 1514 Packet size (bytes) 42 2010 Sep.

Year Ref. H/W IPv4 Throughput 2008 Egi et al . Two quad-core CPUs 3.5 Gbps 2008 “Enhanced Two quad-core CPUs 4.2 Gbps SR” Kernel Bolla et al . 2009 “RouteBricks” Two quad-core CPUs 8.7 Gbps Dobrescu et al . (2.8 GHz) 2010 PacketShader Two quad-core CPUs 28.2 Gbps (CPU-only) (2.66 GHz) User 2010 PacketShader Two quad-core CPUs 39.2 Gbps (CPU+GPU) + two GPUs 43 2010 Sep.

Conclusions  GPU  a great opportunity for fast packet processing  PacketShader  Optimized packet I/O + GPU acceleration  scalable with • # of multi-core CPUs, GPUs, and high-speed NICs  Current Prototype  Supports IPv4, IPv6, OpenFlow, and IPsec  40 Gbps performance on a single PC 44 2010 Sep.

Future Work  Control plane integration  Dynamic routing protocols with Quagga or Xorp  Multi-functional, modular programming environment  Integration with Click? [Kohler99]  Opportunistic offloading  CPU at low load  GPU at high load  Stateful packet processing 45 2010 Sep.

PacketShader: A GPU-Accelerated Software Router Sangjin Han In - PowerPoint PPT Presentation

PacketShader: A GPU-Accelerated Software Router Sangjin Han In collaboration with: Keon Jang , KyoungSoo Park , Sue Moon Advanced Networking Lab, CS, KAIST Networked and Distributed Computing Systems Lab, EE, KAIST 2010

PacketShader: A GPU-Accelerated Software Router Some images and sentence are from original author

Silego Technology Mike Noonen Sales & Business Development CEO (interim) & Board

The TAU 2016 Contest Timing Macro Modeling Jin Hu Song Chen Xin Zhao Xi Chen IBM Corp.

Scaling VLSI Design Debugging with Interpolation Brian Keng and Andreas Veneris FMCAD 2009

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

Computer Architecture An Introduction Virendra Singh Associate Professor Computer Architecture

LPGPU ! Low-Power Parallel Compu1ng on GPUs Ben

January 28, Week 3 Today: Chapter 2, Constant Accleration Homework Assignment #3 - Due February 1

Hillas Plot: trivial and non-trivial implications Felix Aharonian 1984 Hillas Plot : B L

Cosmological Acceleration from a gas of strings Francesc Ferrer Washington University in St.

Disclosure Early Prediction of Chronic Pulmonary Not a cardiologist (sorry!) Hypertension in

RELATIVITY OF SPACE AND TIME IN POPULAR SCIENCE RICHARD ANANTUA x 0 : 7:30p EST x i : Flatiron

SCHEME SPHINCS-256 Dorian Amiet 1 , Andreas Curiger 2 and Paul Zbinden 1 1 HSR Hochschule fr

Adaptive FPGA-based Database Accelerators Achievements, Possibilities, and Challenges Daniel

CryptoManiac: A Fast Flexible Architecture for Secure Communication Lisa Wu, Chris Weaver, and

python @ Strand python @ Strand Overview Strands avadis TM platform Used in several

Classifying Stress Patterns by Cognitive Complexity James Rogers Dept. of Computer Science

SDS Aplications - Speech-to-speech translation - Anca Burducea May 28, 2015 S2S Translation

Speech Processing for Speech Processing for Unwritten Languages Unwritten Languages Alan W

MDL-Based Models for Transliteration Generation . . . . . . 1 / 27 . . Etymon Project

Edge Computing for IoT Application Scenarios RESCOM Summer School, Le Croisic, France June 23,

Paxos! CSE 452 Slides from Lorenzo Alvisi, Doug Woos, Tom Anderson State machine replication Want

Optimal parallel repetition for projection games on low threshold rank graphs Madhur Tulsiani,

Lecture 18: Science and Values 1 * Reading: Lots is indirectly relevant, but nothing in the

PacketShader: A GPU-Accelerated Software Router Sangjin Han In - PowerPoint PPT Presentation

PacketShader: A GPU-Accelerated Software Router Sangjin Han In collaboration with: Keon Jang , KyoungSoo Park , Sue Moon Advanced Networking Lab, CS, KAIST Networked and Distributed Computing Systems Lab, EE, KAIST 2010

PacketShader: A GPU-Accelerated Software Router Some images and sentence are from original author

Silego Technology Mike Noonen Sales &amp; Business Development CEO (interim) &amp; Board

The TAU 2016 Contest Timing Macro Modeling Jin Hu Song Chen Xin Zhao Xi Chen IBM Corp.

Scaling VLSI Design Debugging with Interpolation Brian Keng and Andreas Veneris FMCAD 2009

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

Computer Architecture An Introduction Virendra Singh Associate Professor Computer Architecture

LPGPU ! Low-Power Parallel Compu1ng on GPUs Ben

January 28, Week 3 Today: Chapter 2, Constant Accleration Homework Assignment #3 - Due February 1

Hillas Plot: trivial and non-trivial implications Felix Aharonian 1984 Hillas Plot : B L

Cosmological Acceleration from a gas of strings Francesc Ferrer Washington University in St.

Disclosure Early Prediction of Chronic Pulmonary Not a cardiologist (sorry!) Hypertension in

RELATIVITY OF SPACE AND TIME IN POPULAR SCIENCE RICHARD ANANTUA x 0 : 7:30p EST x i : Flatiron

SCHEME SPHINCS-256 Dorian Amiet 1 , Andreas Curiger 2 and Paul Zbinden 1 1 HSR Hochschule fr

Adaptive FPGA-based Database Accelerators Achievements, Possibilities, and Challenges Daniel

CryptoManiac: A Fast Flexible Architecture for Secure Communication Lisa Wu, Chris Weaver, and

python @ Strand python @ Strand Overview Strands avadis TM platform Used in several

Classifying Stress Patterns by Cognitive Complexity James Rogers Dept. of Computer Science

SDS Aplications - Speech-to-speech translation - Anca Burducea May 28, 2015 S2S Translation

Speech Processing for Speech Processing for Unwritten Languages Unwritten Languages Alan W

MDL-Based Models for Transliteration Generation . . . . . . 1 / 27 . . Etymon Project

Edge Computing for IoT Application Scenarios RESCOM Summer School, Le Croisic, France June 23,

Paxos! CSE 452 Slides from Lorenzo Alvisi, Doug Woos, Tom Anderson State machine replication Want

Optimal parallel repetition for projection games on low threshold rank graphs Madhur Tulsiani,

Lecture 18: Science and Values 1 * Reading: Lots is indirectly relevant, but nothing in the

Silego Technology Mike Noonen Sales & Business Development CEO (interim) & Board