The Forwarding Plane: An Old New Frontier of Networking Research - - PowerPoint PPT Presentation

the forwarding plane an old new frontier of networking
SMART_READER_LITE
LIVE PREVIEW

The Forwarding Plane: An Old New Frontier of Networking Research - - PowerPoint PPT Presentation

The Forwarding Plane: An Old New Frontier of Networking Research CS244, Spring 2019 Changhoon Kim chang@barefootnetworks.com 2 What is SDN in plain English? Ideally at the level for college freshmen Because, if you cant, you are


slide-1
SLIDE 1

The Forwarding Plane: An Old New Frontier of Networking Research

CS244, Spring 2019 Changhoon Kim

chang@barefootnetworks.com

slide-2
SLIDE 2

2

slide-3
SLIDE 3

What is SDN in plain English?

  • Ideally at the level for college freshmen

– Because, if you can’t, you are not really understanding it! [Feynman’s guiding principle]

3

“Making programming networks as easy as programming computers.”

slide-4
SLIDE 4

Natural questions that follow

  • Why should we program a network?

– To realize some “beautiful ideas” easily, preferably on our own

  • What are those “beautiful ideas”?

– Any impactful or intriguing apps in particular?

  • Why couldn’t we do this easily in the pre-SDN era?

– Any fundamental shifts happening?

4

“Making programming networks as easy as programming computers.”

slide-5
SLIDE 5

Pre-SDN state of the network industry

5

Network Equipment Vendor

Network Owner

Software Team

Engineering Division Feature Years

ASIC Team

Feature Years

slide-6
SLIDE 6

Compared to other industries, this is very unnatural

  • Because we all know how to realize our own ideas by

programming CPUs, GPUs, TPUs, etc.

– Programs used in every phase (implement, verify, test, deploy, and maintain) – Extremely fast iteration and differentiation – We own our own ideas – A sustainable ecosystem where all participants benefit

6

Can we replicate this healthy, sustainable ecosystem for networking?

slide-7
SLIDE 7

7

Network Equipment Vendor

Network Owner

ASIC Team Software Team Feature Years

What SDN pioneers had realized …

slide-8
SLIDE 8

8

Network Forwarding-plane Vendor

Network Owner

ASIC Team Software Team

Weeks to Months

Years

Feature

And, SDN started to unfold …

Network Control-plane Vendor

slide-9
SLIDE 9

9

Network Forwarding-plane Vendor

Network Owner

ASIC Team

Years

Feature

And, SDN started to unfold …

Various Control-plane Projects

Feature

Weeks to Months

Software Team

Days to Weeks

Innovation-deprived,

  • ssified layer

Innovation-rich, programmable layer

slide-10
SLIDE 10

Reality: Packet forwarding speeds

0.1 1 10 100 1000 10000 100000 1990 1995 2000 2005 2010 2015 2020 Switch Chip CPU

10

Gb/s

(per chip)

6.4Tb/s

slide-11
SLIDE 11

Reality: Packet forwarding speeds

0.1 1 10 100 1000 10000 100000 1990 1995 2000 2005 2010 2015 2020 Switch Chip CPU

11

80x

Gb/s

(per chip)

6.4Tb/s

Unaccommodating, performance-dominated zone?

slide-12
SLIDE 12

“Programmable switches are 10 -100x slower

than fixed-function switches. They cost more and consume more power.”

Conventional wisdom in networking

slide-13
SLIDE 13

Evidence: Tofino 6.5Tb/s switch (arrived Dec 2016)

The world’s fastest and most programmable switch.

No power or cost penalty compared to fixed-function switches. An incarnation of PISA (Protocol Independent Switch Architecture)

slide-14
SLIDE 14

Domain-specific processors

CPU

Computers Java Compiler

GPU

Graphics OpenCL Compiler

DSP

Signal Processing Matlab Compiler Machine Learning

?

TPU

TensorFlow

Compiler Networking

?

Language Compiler

>>>

slide-15
SLIDE 15

Networking P4 Compiler

>>>

Domain-specific processors

CPU

Computers Java Compiler

GPU

Graphics OpenCL Compiler

DSP

Signal Processing Matlab Compiler Machine Learning

?

TPU

TensorFlow

Compiler PISA

(Protocol-Independent Switch Architecture)

slide-16
SLIDE 16

PISA: An architecture for high-speed programmable packet forwarding

16

slide-17
SLIDE 17

17

Programmable Parser

Match

Memory

Action

ALU

PISA: Protocol Independent Switch Architecture

slide-18
SLIDE 18

18

Programmable Parser

PISA: Protocol Independent Switch Architecture

Ingress Egress Buffer

slide-19
SLIDE 19

Buffer M M

19

Programmable Parser

PISA: Protocol Independent Switch Architecture

Match Logic

(Mix of SRAM and TCAM for lookup tables, counters, meters, generic hash tables)

Action Logic

(ALUs for standard boolean and arithmetic operations, header modification operations, hashing operations, etc.) Recirculation Programmable Packet Generator CPU (Control plane) A

A

Ingress match-action stages (pre-switching) Egress match-action stages (post-switching)

Generalization of RMT [sigcomm’13]

slide-20
SLIDE 20

Why we call it protocol-independent packet processing

20

slide-21
SLIDE 21

Logical Data-plane View (your P4 program) Switch Pipeline

Device does not understand any protocols until it gets programmed

Queues Programmable Parser

Fixed Action Match Table Match Table Match Table Match Table L2 IPv4 IPv6 ACL Action ALUs Action ALUs Action ALUs Action ALUs

packet packet packet packet

CLK

21

slide-22
SLIDE 22

Match Table Action ALUs

Mapping logical data-plane design to physical resources

Queues

Match Table Match Table Match Table L2 Table IPv4 Table IPv6 Table ACL Table Action ALUs Action ALUs Action ALUs L2 IPv4 IPv6 ACL

Logical Data-plane View (your P4 program) Switch Pipeline

L2 IPv6 ACL IPv4

L2 Action Macro v4 Action Macro v6 Action Macro ACL Action Macro

Programmable Parser

CLK

22

slide-23
SLIDE 23

Re-program in the field

L2 Table IPv4 Table ACL Table IPv6 Table

My Encap

L2 IPv4 IPv6 ACL

MyEncap

L2 Action Macro v4 Action Macro ACL Action Macro Action

MyEncap

v6 Action Macro

IPv4

Action

IPv4

Action

IPv6

Action

IPv6

Programmable Parser

CLK

Logical Data-plane View (your P4 program) Switch Pipeline Queues

23

slide-24
SLIDE 24

P4 language components

Parser Program Control Flow

State-machine; Field extraction Table lookup and update; Field manipulation; Control flow Field assembly

No: memory (pointers), loops, recursion, floating point

Match Tables + Actions Deparser Program

24

slide-25
SLIDE 25

Questions and critiques …?

§ What does a compiler do? § What’s the latest on P4? Have you heard of P416? § How do you update tables at runtime? § Why is it important to derive a runtime API from a P4 program? § What about queueing, scheduling, and congestion control?

25

slide-26
SLIDE 26

What exactly does a compiler do?

Queues Programmable Parser

CLK … … … …

Match Table

(SRAM or TCAM)

Cross Bar Hash Gen PHV (Packet Header Vector) Action & Instr Mem PHV’

key params action constant

ALUs

26

slide-27
SLIDE 27

P416: Why and how?

§ Embrace target heterogeneity without language churns

§ Architectural heterogeneity via architecture-language separation § Functional heterogeneity via extern types

§ Help reuse code more easily: portability and composability

§ Standard architecture and standard library § Local name space, local variables, lexical scoping, parameterization, and sub-procedure-like constructs

§ Make P4 programs more intuitive and explicit

§ Expressions, sequential execution semantics for actions, strong type, and explicit de-parsing

27

slide-28
SLIDE 28

To recap: Why data-plane programming?

  • 1. New features: Realize your beautiful ideas very quickly
  • 2. Reduce complexity: Remove unnecessary features and tables
  • 3. Efficient use of H/W resources: Achieve biggest bang for buck
  • 4. Greater visibility: New diagnostics, telemetry, OAM, etc.
  • 5. Modularity: Compose forwarding behavior from libraries
  • 6. Portability: Specify forwarding behavior once; compile to many devices
  • 7. Own your own ideas: No need to share your ideas with others

“Protocols are being lifted off chips and into software”

– Ben Horowitz

28

slide-29
SLIDE 29

What kind of “stunt” can you do by programming data planes?

29

slide-30
SLIDE 30

§ Advanced network measurement, analysis, and diagnostics

§ In-band Network Telemetry [SIGCOMM’15], Packet History [NSDI’14], FlowRadar [NSDI’16], Marple [SIGCOMM’17]

§ Advanced congestion control

§ RCP, XCP, TeXCP, DCQCN++, Timely++

§ Novel DC network fabric

§ Flowlet switching, CONGA [SIGCOMM’15], HULA [SOSR’16], NDP [SIGCOMM’17]

§ World’s fastest middleboxes

§ L4 connection load balancing [SIGCOMM’17], TCP SYN authentication, etc.

§ Offloading parts of the distributed apps

§ NetCache [SOSP’17], NetChain [NSDI’18], SwitchPaxos [SOSR’15, ACM CCR‘16]

§ Jointly optimizing network and the apps running on it

§ Mostly-ordered Multicast [NSDI’15, SOSP’15]

§ And many more … -- we’re just starting to scratch the surface!

30

slide-31
SLIDE 31

PISA: An architecture for high-speed programmable packet forwarding

31

I/O event processing

slide-32
SLIDE 32

What we have seen so far: Accelerating part of computing with PISA

  • 1. DNS cache
  • 2. Key-value cache [NetCache - SOSP’17]
  • 3. Key-value replication [NetChain - NSDI’18]
  • 4. Consensus acceleration [P4xos - CCR’16, Eris - SOSP’17]
  • 5. Parameter service for distributed deep learning
  • 6. Pub-sub service
  • 7. String searching [PPS – SOSR’19]
  • 8. Pre-processing DB queries and streams

32

slide-33
SLIDE 33

NetCache: Accelerating KV caching

33

slide-34
SLIDE 34

Server Load

ToR gets and puts

Suppose a KV cluster coping with a highly-skewed & rapidly-changing workload

slide-35
SLIDE 35

Q: How can you ensure a high throughput and bound tail latency?

Server Load

Suppose a KV cluster coping with a highly-skewed & rapidly-changing workload

ToR gets and puts

slide-36
SLIDE 36

Here comes the problem

uQiforP ziSf-0.9 ziSf-0.95 ziSf-0.99 WorNloDd DisWribuWioQ 0.0 0.5 1.0 1.5 2.0 ThroughSuW (BQPS)

1oCDche 1eWCDche(servers) 1eWCDche(cDche)

slide-37
SLIDE 37

37

KV Servers Load

ToR

gets and puts

Front-end Server

A read-only cache handling hot keys directly!

Q: How big and fast the front-end cache should be?

What if we had a very fast front-end server?

slide-38
SLIDE 38
  • How big should it be?

– Keep O(N*logN) hot keys where N is the number of KV servers – Theory proves that such a front-end cache bounds the variance of KV server utilization irrespective of the total number of keys

  • How fast should it be?

– At least as large as the aggregated throughput of all KV servers (N*C)

38

For a front-end cache to be effective

slide-39
SLIDE 39

Why is this relevant now?

39

storage layer flash/disk

each: O(100) KQPS total: O(10) MQPS

Cache needs to provide the aggregate throughput of the storage layer in-memory

each: O(10) MQPS total: O(1) BQPS

cache layer in-memory

O(10) MQPS

cache

O(1) BQPS

cache

slide-40
SLIDE 40

Why is this relevant now?

40

storage layer flash/disk

each: O(100) KQPS total: O(10) MQPS

Cache needs to provide the aggregate throughput of the storage layer in-memory

each: O(10) MQPS total: O(1) BQPS

cache layer in-memory

O(10) MQPS

cache

O(1) BQPS

cache

Small on-chip memory? Only cache O(N log N) small items

PISA (real-time I/O machine)

slide-41
SLIDE 41

Data plane (ASIC) Control plane (CPU)

Network Functions Network Management Run-time API

Match + Action

Programmable Parser Programmable Match-Action Pipeline

Memory ALU

… … …

PCIe

A conventional switch built with PISA

slide-42
SLIDE 42

A front-end KV cache built with PISA

KV Servers Front-end KV Cache

Clients

  • Data plane

– Key-value store to serve queries for cached keys – Query statistics to enable efficient cache updates

  • Control plane

– Insert hot items into the cache and evict less popular items – Manage memory allocation for on-chip key-value store

Key-Value Cache Query Statistics Cache Management Run-time API

PCIe

slide-43
SLIDE 43

Line-rate query handling in the data plane

Cache

Client 1 2 Server

Read Query (cache hit)

Hit

Stats Update

Client Server 1 4 3 2

Write Query

Invalidate

Cache Stats

Client 1 4 Server 3 2

Read Query (cache miss)

Cache

Miss

Stats Update

slide-44
SLIDE 44

Packet format

44

  • Application-layer protocol; compatible with existing L2-L4 layers
  • Only the front-end cache needs to parse NetCache fields

ETH IP TCP/UDP OP KEY VALUE Existing Protocols NetCache Protocol read, write, delete, etc. reserved port # L2/L3 Routing SEQ

slide-45
SLIDE 45

Key-value store using register array in network ASIC

action process_array(idx): if pkt.op == read: pkt.value array[idx] elif pkt.op == cache_update: array[idx] pkt.value 1 2 3

Register Array

slide-46
SLIDE 46

Key-value store using register array in network ASIC

Match pkt.key == A pkt.key == B Action process_array(0) process_array(1) action process_array(idx): if pkt.op == read: pkt.value array[idx] elif pkt.op == cache_update: array[idx] pkt.value 1 2 3 A B

Register Array

pkt.value: B A

slide-47
SLIDE 47

Variable-length key-value store in network ASIC?

Match pkt.key == A pkt.key == B Action process_array(0) process_array(1) 1 2 3 A B

Register Array

pkt.value: B A

Key Challenges:

q

No loop or string due to strict timing requirements

q

Need to minimize hardware resources consumption

§ Number of table entries § Size of action data for table each entry § Size of intermediate metadata across tables

slide-48
SLIDE 48

Combine outputs from multiple arrays

Match pkt.key == A Action bitmap = 111 index = 0 Match bitmap[0] == 1 Action process_array_0 (index ) 1 2 3 A0 Register Array 0 Lookup Table Value Table 0 Register Array 1 Register Array 2 Match bitmap[1] == 1 Action process_array_1 (index ) Match bitmap[2] == 1 Action process_array_2 (index ) Value Table 1 Value Table 2 A1 A2 pkt.value: A0 A1 A2

Bitmap indicates arrays that store the key’s value Index indicates slots in the arrays to get the value Minimal hardware resource overhead

slide-49
SLIDE 49

Match pkt.key == A pkt.key == B Action bitmap = 111 index = 0 bitmap = 110 index = 1 Match bitmap[0] == 1 Action process_array_0 (index ) 1 2 3 A0 B0 Register Array 0 Lookup Table Value Table 0 Register Array 1 Register Array 2 Match bitmap[1] == 1 Action process_array_1 (index ) Match bitmap[2] == 1 Action process_array_2 (index ) Value Table 1 Value Table 2 A1 B1 A2

Combine outputs from multiple arrays

pkt.value: A0 A1 A2 B0 B1

slide-50
SLIDE 50

Match pkt.key == A pkt.key == B pkt.key == C Action bitmap = 111 index = 0 bitmap = 110 index = 1 bitmap = 010 index = 2 Match bitmap[0] == 1 Action process_array_0 (index ) 1 2 3 A0 B0 Register Array 0 Lookup Table Value Table 0 Register Array 1 Register Array 2 Match bitmap[1] == 1 Action process_array_1 (index ) Match bitmap[2] == 1 Action process_array_2 (index ) Value Table 1 Value Table 2 A1 B1 C0 A2

Combine outputs from multiple arrays

pkt.value: A0 A1 A2 B0 B1 C0

slide-51
SLIDE 51

Match pkt.key == A pkt.key == B pkt.key == C pkt.key == D Action bitmap = 111 index = 0 bitmap = 110 index = 1 bitmap = 010 index = 2 bitmap = 101 index = 2 Match bitmap[0] == 1 Action process_array_0 (index ) 1 2 3 A0 B0 D0 Register Array 0 Lookup Table Value Table 0 Register Array 1 Register Array 2 Match bitmap[1] == 1 Action process_array_1 (index ) Match bitmap[2] == 1 Action process_array_2 (index ) Value Table 1 Value Table 2 A1 B1 C0 A2 D1

Combine outputs from multiple arrays

pkt.value: A0 A1 A2 B0 B1 C0 D0 D1

slide-52
SLIDE 52

Cache insertion and eviction

q Challenge: Keeping the hottest O(N logN) items in the cache q Goal: React quickly and effectively to workload changes with minimal updates

Key-Value Cache Query Statistics Cache Management

PCIe

1 2 3 4 1 Data plane reports hot keys 2 Control plane compares loads of new hot and sampled cached keys 3 Control plane fetches values for keys to be inserted to the cache 4 Control plane inserts and evicts keys

KV Servers Front-end KV Cache

slide-53
SLIDE 53

Query statistics in the data plane

  • Cached key: per-key counter array
  • Uncached key

– Count-Min sketch: report new hot keys – Bloom filter: remove duplicated hot key reports

Per-key counters for each cached item Count-Min sketch pkt.key not cached cached

hot

Bloom filter report

Cache Lookup

slide-54
SLIDE 54

The “boring life” of a NetCache switch

32 64 96 128 9alue 6ize (Byte) 0.0 0.5 1.0 1.5 2.0 2.5 ThroughSut (B436) 16. 32. 48. 64. CacKe 6ize 0.0 0.5 1.0 1.5 2.0 2.5 TKrougKSut (B436)

(b) Throughput vs. cache size.

One can further increase the value sizes with more stages, recirculation, or mirroring.

Yes, it’s Billion Queries Per Sec, not a typo J

slide-55
SLIDE 55

And its “not so boring” benefits

NetCache provides 3-10x throughput improvements. Throughput of a key-value storage rack with

  • ne Tofino switch and 128 storage servers.

uQiforP ziSf-0.9 ziSf-0.95 ziSf-0.99 WorNloDd DisWribuWioQ 0.0 0.5 1.0 1.5 2.0 ThroughSuW (BQPS)

1oCDche 1eWCDche(servers) 1eWCDche(cDche)

slide-56
SLIDE 56

Questions and critiques …?

  • Conflating “in-network computing” with “accelerating apps

using PISA”

  • What are the common and unique strengths of PISA for those
  • ffload-able apps?
  • Will the “stunt” become a main stream approach? What

would be the triggers for that?

  • Is P4 the right way of implementing PISA-server apps?

56

slide-57
SLIDE 57
  • PISA and P4: The first attempt to define a machine

architecture and programming models for networking in a disciplined way

  • Inherently multi-disciplinary; we need more expertise

across various fields in computer science

  • It’s super fun to figure out the best workloads for this new

machine architecture

57

Some observations

slide-58
SLIDE 58

Want to find more resources or follow up?

  • Visit http://p4.org and http://github.com/p4lang

– P4 language spec – P4 dev tools and sample programs – P4 tutorials – List of papers regarding PISA, PISA Apps, and P4

  • Join P4 workshops and P4 developers’ days
  • Participate in P4 working group activities

– Language, target architecture, runtime API, applications

  • Need more expertise across various fields in computer science

– To enhance PISA, P4, dev tools (e.g., for formal verification, equivalence check, automated test generation, and many more …)

58