PacketShader: A GPU-Accelerated Software Router Sangjin Han In - - PowerPoint PPT Presentation

packetshader
SMART_READER_LITE
LIVE PREVIEW

PacketShader: A GPU-Accelerated Software Router Sangjin Han In - - PowerPoint PPT Presentation

PacketShader: A GPU-Accelerated Software Router Sangjin Han In collaboration with: Keon Jang , KyoungSoo Park , Sue Moon Advanced Networking Lab, CS, KAIST Networked and Distributed Computing Systems Lab, EE, KAIST 2010


slide-1
SLIDE 1

2010 Sep.

PacketShader:

A GPU-Accelerated Software Router

Sangjin Han†

In collaboration with:

Keon Jang †, KyoungSoo Park‡, Sue Moon †

† Advanced Networking Lab, CS, KAIST ‡ Networked and Distributed Computing Systems Lab, EE, KAIST

slide-2
SLIDE 2

2010 Sep. 2

PacketShader:

A GPU-Accelerated Software Router

High-performance

Our prototype: 40 Gbps on a single box

slide-3
SLIDE 3

2010 Sep.

Software Router

  • Despite its name, not limited to IP routing
  • You can implement whatever you want on it.
  • Driven by software
  • Flexible
  • Friendly development environments
  • Based on commodity hardware
  • Cheap
  • Fast evolution

3

slide-4
SLIDE 4

2010 Sep.

Now 10 Gigabit NIC is a commodity

  • From $200 – $300 per port
  • Great opportunity for software routers

4

slide-5
SLIDE 5

2010 Sep.

Achilles’ Heel of Software Routers

  • Low performance
  • Due to CPU bottleneck

5

Year Ref. H/W IPv4 Throughput 2008 Egi et al. Two quad-core CPUs 3.5 Gbps 2008 “Enhanced SR” Bolla et al. Two quad-core CPUs 4.2 Gbps 2009 “RouteBricks” Dobrescu et al. Two quad-core CPUs (2.8 GHz) 8.7 Gbps

  • Not capable of supporting even a single 10G port
slide-6
SLIDE 6

2010 Sep.

CPU BOTTLENECK

6

slide-7
SLIDE 7

2010 Sep.

Per-Packet CPU Cycles for 10G

7

1,200 600 1,200 1,600

Cycles needed

Packet I/O IPv4 lookup = 1,800 cycles = 2,800

Your budget

1,400 cycles 10G, min-sized packets, dual quad-core 2.66GHz CPUs

5,400 1,200

… = 6,600 Packet I/O IPv6 lookup Packet I/O Encryption and hashing

IPv4 IPv6 IPsec + + + (in x86, cycle numbers are from RouteBricks [ Dobrescu09] and ours)

slide-8
SLIDE 8

2010 Sep.

Our Approach 1: I/O Optimization

8

Packet I/O Packet I/O Packet I/O Packet I/O

  • 1,200 reduced to 200 cycles

per packet

  • Main ideas
  • Huge packet buffer
  • Batch processing

600 1,600

IPv4 lookup = 1,800 cycles = 2,800

5,400

… = 6,600 IPv6 lookup Encryption and hashing

+ + +

1,200 1,200 1,200

slide-9
SLIDE 9

2010 Sep.

Our Approach 2: GPU Offloading

9

Packet I/O Packet I/O Packet I/O

  • GPU Offloading for
  • Memory-intensive or
  • Compute-intensive
  • perations
  • Main topic of this talk

600 1,600

IPv4 lookup

5,400

… IPv6 lookup Encryption and hashing

+ + +

slide-10
SLIDE 10

2010 Sep.

WHAT IS GPU?

10

slide-11
SLIDE 11

2010 Sep.

GPU = Graphics Processing Unit

  • The heart of graphics cards
  • Mainly used for real-time 3D game rendering
  • Massively-parallel processing capacity

11

(Ubisoft’s AVARTAR, from http: / / ubi.com)

slide-12
SLIDE 12

2010 Sep.

CPU vs. GPU

12

CPU:

Small # of super-fast cores

GPU:

Large # of small cores

slide-13
SLIDE 13

2010 Sep.

“Silicon Budget” in CPU and GPU

13

Xeon X5550:

4 cores 731M transistors

GTX480:

480 cores 3,200M transistors

ALU

slide-14
SLIDE 14

2010 Sep.

GPU FOR PACKET PROCESSING

14

slide-15
SLIDE 15

2010 Sep.

Advantages of GPU for Packet Processing

  • 1. Raw computation power
  • 2. Memory access latency
  • 3. Memory bandwidth
  • Comparison between
  • Intel X5550 CPU
  • NVIDIA GTX480 GPU

15

slide-16
SLIDE 16

2010 Sep.

(1/3) Raw Computation Power

  • Compute-intensive operations in software routers
  • Hashing, encryption, pattern matching, network coding,

compression, etc.

  • GPU can help!

16

CPU: 43×109

= 2.66 (GHz) × 4 (# of cores) × 4 (4-way superscalar)

GPU: 672×109

= 1.4 (GHz) × 480 (# of cores)

Instructions/sec

<

slide-17
SLIDE 17

2010 Sep.

(2/3) Memory Access Latency

  • Software router  lots of cache misses
  • GPU can effectively hide memory latency

17

GPU core

Cache miss Cache miss

Switch to Thread 2 Switch to Thread 3

slide-18
SLIDE 18

2010 Sep.

(3/3) Memory Bandwidth

18

CPU’s memory bandwidth (theoretical): 32 GB/ s

slide-19
SLIDE 19

2010 Sep.

(3/3) Memory Bandwidth

19

CPU’s memory bandwidth (empirical) < 25 GB/ s

  • 4. TX:

RAM  NIC

  • 3. TX:

CPU  RAM

  • 2. RX:

RAM  CPU

  • 1. RX:

NIC  RAM

slide-20
SLIDE 20

2010 Sep.

(3/3) Memory Bandwidth

20

Your budget for packet processing can be less 10 GB/ s

slide-21
SLIDE 21

2010 Sep.

(3/3) Memory Bandwidth

21

Your budget for packet processing can be less 10 GB/ s

GPU’s memory bandwidth: 174GB/ s

slide-22
SLIDE 22

2010 Sep.

HOW TO USE GPU

22

slide-23
SLIDE 23

2010 Sep.

Basic Idea

23

Offload core operations to GPU

(e.g., forwarding table lookup)

slide-24
SLIDE 24

2010 Sep.

Recap

24

GTX480: 480 cores

  • For GPU, more parallelism, more throughput
slide-25
SLIDE 25

2010 Sep.

Parallelism in Packet Processing

  • The key insight
  • Stateless packet processing = parallelizable

25

RX queue

  • 1. Batching
  • 2. Parallel Processing

in GPU

slide-26
SLIDE 26

2010 Sep.

Batching  Long Latency?

  • Fast link = enough # of packets in a small time

window

  • 10 GbE link
  • up to 1,000 packets only in 67μs
  • Much less time with 40 or 100 GbE

26

slide-27
SLIDE 27

2010 Sep.

PACKETSHADER DESIGN

27

slide-28
SLIDE 28

2010 Sep.

Basic Design

  • Three stages in a streamline

28

Pre- shader Shader Post- shader

slide-29
SLIDE 29

2010 Sep.

Packet’s Journey (1/3)

  • IPv4 forwarding example

29

Pre- shader Shader Post- shader

  • Checksum, TTL
  • Format check

Collected

  • dst. IP addrs

Some packets go to slow-path

slide-30
SLIDE 30

2010 Sep.

Packet’s Journey (2/3)

  • IPv4 forwarding example

30

Pre- shader Shader Post- shader

  • 1. IP addresses
  • 2. Forwarding table lookup
  • 3. Next hops
slide-31
SLIDE 31

2010 Sep.

Packet’s Journey (3/3)

  • IPv4 forwarding example

31

Pre- shader Shader Post- shader

Update packets and transmit

slide-32
SLIDE 32

2010 Sep.

Interfacing with NICs

32

Pre- shader Shader Post- shader Device driver Packet RX Device driver Packet TX

slide-33
SLIDE 33

2010 Sep.

Device driver Pre- shader Shader Post- shader Device driver

Scaling with a Multi-Core CPU

Master core Worker cores

33

slide-34
SLIDE 34

2010 Sep. 34

Device driver Pre- shader Shader Post- shader Device driver Shader

Scaling with Multiple Multi-Core CPUs

slide-35
SLIDE 35

2010 Sep.

EVALUATION

35

slide-36
SLIDE 36

2010 Sep.

Hardware Setup

36

CPU:

Quad-core, 2.66 GHz

GPU: NIC:

Total 80 Gbps Dual-port 10 GbE Total 8 CPU cores 480 cores, 1.4 GHz Total 960 cores

slide-37
SLIDE 37

2010 Sep.

Experimental Setup

37

… 8 × 10 GbE links

Packet generator PacketShader

(Up to 80 Gbps)

Input traffic Processed packets

slide-38
SLIDE 38

2010 Sep.

Results (w/ 64B packets)

38

28.2 8 15.6 3 39.2 38.2 32 10.2 5 10 15 20 25 30 35 40 IPv4 IPv6 OpenFlow IPsec Throughput (Gbps) CPU-only CPU+GPU 1.4x 4.8x 2.1x 3.5x GPU speedup

slide-39
SLIDE 39

2010 Sep.

Example 1: IPv6 forwarding

  • Longest prefix matching on 128-bit IPv6 addresses
  • Algorithm: binary search on hash tables [Waldvogel97]
  • 7 hashings + 7 memory accesses

39

… … … …

Prefix length 1 64 128 96 80

slide-40
SLIDE 40

2010 Sep.

Example 1: IPv6 forwarding

(Routing table was randomly generated with 200K entries)

40

5 10 15 20 25 30 35 40 45 64 128 256 512 1024 1514 Throughput (Gbps) Packet size (bytes) CPU-only CPU+GPU

Bounded by motherboard IO capacity

slide-41
SLIDE 41

2010 Sep.

Example 2: IPsec tunneling

41

IP header IP payload IP header IP payload ESP trailer

  • ESP (Encapsulating Security Payload) Tunnel mode
  • with AES-CTR (encryption) and SHA1 (authentication)

+

Original IP packet

IP header IP payload ESP trailer ESP header + IP header IP payload ESP trailer ESP header ESP Auth. New IP header +

IPsec Packet

  • 1. AES
  • 2. SHA1
slide-42
SLIDE 42

2010 Sep.

Example 2: IPsec tunneling

  • 3.5x speedup

42

1 1.5 2 2.5 3 3.5 4 4 8 12 16 20 24 64 128 256 512 1024 1514 Speedup Throughput (Gbps) Packet size (bytes) CPU-only CPU+GPU Speedup

slide-43
SLIDE 43

2010 Sep. 43

Year Ref. H/W IPv4 Throughput 2008 Egi et al. Two quad-core CPUs 3.5 Gbps 2008 “Enhanced SR” Bolla et al. Two quad-core CPUs 4.2 Gbps 2009 “RouteBricks” Dobrescu et al. Two quad-core CPUs (2.8 GHz) 8.7 Gbps 2010 PacketShader (CPU-only) Two quad-core CPUs (2.66 GHz) 28.2 Gbps 2010 PacketShader (CPU+GPU) Two quad-core CPUs + two GPUs 39.2 Gbps

Kernel User

slide-44
SLIDE 44

2010 Sep.

Conclusions

  • GPU
  • a great opportunity for fast packet processing
  • PacketShader
  • Optimized packet I/O + GPU acceleration
  • scalable with
  • # of multi-core CPUs, GPUs, and high-speed NICs
  • Current Prototype
  • Supports IPv4, IPv6, OpenFlow, and IPsec
  • 40 Gbps performance on a single PC

44

slide-45
SLIDE 45

2010 Sep.

Future Work

  • Control plane integration
  • Dynamic routing protocols with Quagga or Xorp
  • Multi-functional, modular programming environment
  • Integration with Click? [Kohler99]
  • Opportunistic offloading
  • CPU at low load
  • GPU at high load
  • Stateful packet processing

45