The Forwarding Plane: An Old New Frontier of Networking Research
CS244, Spring 2019 Changhoon Kim
chang@barefootnetworks.com
The Forwarding Plane: An Old New Frontier of Networking Research - - PowerPoint PPT Presentation
The Forwarding Plane: An Old New Frontier of Networking Research CS244, Spring 2019 Changhoon Kim chang@barefootnetworks.com 2 What is SDN in plain English? Ideally at the level for college freshmen Because, if you cant, you are
chang@barefootnetworks.com
2
3
– To realize some “beautiful ideas” easily, preferably on our own
– Any impactful or intriguing apps in particular?
– Any fundamental shifts happening?
4
5
Network Equipment Vendor
Network Owner
Software Team
Engineering Division Feature Years
ASIC Team
Feature Years
– Programs used in every phase (implement, verify, test, deploy, and maintain) – Extremely fast iteration and differentiation – We own our own ideas – A sustainable ecosystem where all participants benefit
6
7
Network Equipment Vendor
Network Owner
ASIC Team Software Team Feature Years
8
Network Forwarding-plane Vendor
Network Owner
ASIC Team Software Team
Weeks to Months
Years
Feature
Network Control-plane Vendor
9
Network Forwarding-plane Vendor
Network Owner
ASIC Team
Years
Feature
Various Control-plane Projects
Feature
Weeks to Months
Software Team
Days to Weeks
Innovation-deprived,
Innovation-rich, programmable layer
0.1 1 10 100 1000 10000 100000 1990 1995 2000 2005 2010 2015 2020 Switch Chip CPU
10
Gb/s
(per chip)
0.1 1 10 100 1000 10000 100000 1990 1995 2000 2005 2010 2015 2020 Switch Chip CPU
11
Gb/s
(per chip)
Unaccommodating, performance-dominated zone?
Conventional wisdom in networking
No power or cost penalty compared to fixed-function switches. An incarnation of PISA (Protocol Independent Switch Architecture)
CPU
Computers Java Compiler
GPU
Graphics OpenCL Compiler
DSP
Signal Processing Matlab Compiler Machine Learning
TPU
TensorFlow
Compiler Networking
Language Compiler
Networking P4 Compiler
CPU
Computers Java Compiler
GPU
Graphics OpenCL Compiler
DSP
Signal Processing Matlab Compiler Machine Learning
TPU
TensorFlow
Compiler PISA
(Protocol-Independent Switch Architecture)
16
17
Programmable Parser
Match
Memory
Action
ALU
18
Programmable Parser
Ingress Egress Buffer
Buffer M M
19
Programmable Parser
Match Logic
(Mix of SRAM and TCAM for lookup tables, counters, meters, generic hash tables)
Action Logic
(ALUs for standard boolean and arithmetic operations, header modification operations, hashing operations, etc.) Recirculation Programmable Packet Generator CPU (Control plane) A
…
A
…
Ingress match-action stages (pre-switching) Egress match-action stages (post-switching)
Generalization of RMT [sigcomm’13]
20
Logical Data-plane View (your P4 program) Switch Pipeline
Queues Programmable Parser
Fixed Action Match Table Match Table Match Table Match Table L2 IPv4 IPv6 ACL Action ALUs Action ALUs Action ALUs Action ALUs
packet packet packet packet
CLK
21
Match Table Action ALUs
Queues
Match Table Match Table Match Table L2 Table IPv4 Table IPv6 Table ACL Table Action ALUs Action ALUs Action ALUs L2 IPv4 IPv6 ACL
Logical Data-plane View (your P4 program) Switch Pipeline
L2 IPv6 ACL IPv4
L2 Action Macro v4 Action Macro v6 Action Macro ACL Action Macro
Programmable Parser
CLK
22
L2 Table IPv4 Table ACL Table IPv6 Table
My Encap
L2 IPv4 IPv6 ACL
MyEncap
L2 Action Macro v4 Action Macro ACL Action Macro Action
MyEncap
v6 Action Macro
IPv4
Action
IPv4
Action
IPv6
Action
IPv6
Programmable Parser
CLK
Logical Data-plane View (your P4 program) Switch Pipeline Queues
23
Parser Program Control Flow
State-machine; Field extraction Table lookup and update; Field manipulation; Control flow Field assembly
Match Tables + Actions Deparser Program
24
§ What does a compiler do? § What’s the latest on P4? Have you heard of P416? § How do you update tables at runtime? § Why is it important to derive a runtime API from a P4 program? § What about queueing, scheduling, and congestion control?
25
Queues Programmable Parser
CLK … … … …
Match Table
(SRAM or TCAM)
Cross Bar Hash Gen PHV (Packet Header Vector) Action & Instr Mem PHV’
key params action constant
ALUs
26
§ Embrace target heterogeneity without language churns
§ Architectural heterogeneity via architecture-language separation § Functional heterogeneity via extern types
§ Help reuse code more easily: portability and composability
§ Standard architecture and standard library § Local name space, local variables, lexical scoping, parameterization, and sub-procedure-like constructs
§ Make P4 programs more intuitive and explicit
§ Expressions, sequential execution semantics for actions, strong type, and explicit de-parsing
27
“Protocols are being lifted off chips and into software”
– Ben Horowitz
28
29
§ In-band Network Telemetry [SIGCOMM’15], Packet History [NSDI’14], FlowRadar [NSDI’16], Marple [SIGCOMM’17]
§ RCP, XCP, TeXCP, DCQCN++, Timely++
§ Flowlet switching, CONGA [SIGCOMM’15], HULA [SOSR’16], NDP [SIGCOMM’17]
§ L4 connection load balancing [SIGCOMM’17], TCP SYN authentication, etc.
§ NetCache [SOSP’17], NetChain [NSDI’18], SwitchPaxos [SOSR’15, ACM CCR‘16]
§ Mostly-ordered Multicast [NSDI’15, SOSP’15]
30
31
32
33
Server Load
ToR gets and puts
Q: How can you ensure a high throughput and bound tail latency?
Server Load
ToR gets and puts
uQiforP ziSf-0.9 ziSf-0.95 ziSf-0.99 WorNloDd DisWribuWioQ 0.0 0.5 1.0 1.5 2.0 ThroughSuW (BQPS)
1oCDche 1eWCDche(servers) 1eWCDche(cDche)
37
KV Servers Load
ToR
gets and puts
Front-end Server
A read-only cache handling hot keys directly!
Q: How big and fast the front-end cache should be?
– Keep O(N*logN) hot keys where N is the number of KV servers – Theory proves that such a front-end cache bounds the variance of KV server utilization irrespective of the total number of keys
– At least as large as the aggregated throughput of all KV servers (N*C)
38
39
storage layer flash/disk
each: O(100) KQPS total: O(10) MQPS
Cache needs to provide the aggregate throughput of the storage layer in-memory
each: O(10) MQPS total: O(1) BQPS
cache layer in-memory
O(10) MQPS
cache
O(1) BQPS
cache
40
storage layer flash/disk
each: O(100) KQPS total: O(10) MQPS
Cache needs to provide the aggregate throughput of the storage layer in-memory
each: O(10) MQPS total: O(1) BQPS
cache layer in-memory
O(10) MQPS
cache
O(1) BQPS
cache
Small on-chip memory? Only cache O(N log N) small items
PISA (real-time I/O machine)
Data plane (ASIC) Control plane (CPU)
Network Functions Network Management Run-time API
Match + Action
Programmable Parser Programmable Match-Action Pipeline
Memory ALU
… … …
PCIe
KV Servers Front-end KV Cache
Clients
– Key-value store to serve queries for cached keys – Query statistics to enable efficient cache updates
– Insert hot items into the cache and evict less popular items – Manage memory allocation for on-chip key-value store
Key-Value Cache Query Statistics Cache Management Run-time API
PCIe
Cache
Client 1 2 Server
Read Query (cache hit)
Hit
Stats Update
Client Server 1 4 3 2
Write Query
Invalidate
Cache Stats
Client 1 4 Server 3 2
Read Query (cache miss)
Cache
Miss
Stats Update
44
ETH IP TCP/UDP OP KEY VALUE Existing Protocols NetCache Protocol read, write, delete, etc. reserved port # L2/L3 Routing SEQ
action process_array(idx): if pkt.op == read: pkt.value array[idx] elif pkt.op == cache_update: array[idx] pkt.value 1 2 3
Register Array
Match pkt.key == A pkt.key == B Action process_array(0) process_array(1) action process_array(idx): if pkt.op == read: pkt.value array[idx] elif pkt.op == cache_update: array[idx] pkt.value 1 2 3 A B
Register Array
pkt.value: B A
Match pkt.key == A pkt.key == B Action process_array(0) process_array(1) 1 2 3 A B
Register Array
pkt.value: B A
Key Challenges:
q
No loop or string due to strict timing requirements
q
Need to minimize hardware resources consumption
§ Number of table entries § Size of action data for table each entry § Size of intermediate metadata across tables
Match pkt.key == A Action bitmap = 111 index = 0 Match bitmap[0] == 1 Action process_array_0 (index ) 1 2 3 A0 Register Array 0 Lookup Table Value Table 0 Register Array 1 Register Array 2 Match bitmap[1] == 1 Action process_array_1 (index ) Match bitmap[2] == 1 Action process_array_2 (index ) Value Table 1 Value Table 2 A1 A2 pkt.value: A0 A1 A2
Bitmap indicates arrays that store the key’s value Index indicates slots in the arrays to get the value Minimal hardware resource overhead
Match pkt.key == A pkt.key == B Action bitmap = 111 index = 0 bitmap = 110 index = 1 Match bitmap[0] == 1 Action process_array_0 (index ) 1 2 3 A0 B0 Register Array 0 Lookup Table Value Table 0 Register Array 1 Register Array 2 Match bitmap[1] == 1 Action process_array_1 (index ) Match bitmap[2] == 1 Action process_array_2 (index ) Value Table 1 Value Table 2 A1 B1 A2
pkt.value: A0 A1 A2 B0 B1
Match pkt.key == A pkt.key == B pkt.key == C Action bitmap = 111 index = 0 bitmap = 110 index = 1 bitmap = 010 index = 2 Match bitmap[0] == 1 Action process_array_0 (index ) 1 2 3 A0 B0 Register Array 0 Lookup Table Value Table 0 Register Array 1 Register Array 2 Match bitmap[1] == 1 Action process_array_1 (index ) Match bitmap[2] == 1 Action process_array_2 (index ) Value Table 1 Value Table 2 A1 B1 C0 A2
pkt.value: A0 A1 A2 B0 B1 C0
Match pkt.key == A pkt.key == B pkt.key == C pkt.key == D Action bitmap = 111 index = 0 bitmap = 110 index = 1 bitmap = 010 index = 2 bitmap = 101 index = 2 Match bitmap[0] == 1 Action process_array_0 (index ) 1 2 3 A0 B0 D0 Register Array 0 Lookup Table Value Table 0 Register Array 1 Register Array 2 Match bitmap[1] == 1 Action process_array_1 (index ) Match bitmap[2] == 1 Action process_array_2 (index ) Value Table 1 Value Table 2 A1 B1 C0 A2 D1
pkt.value: A0 A1 A2 B0 B1 C0 D0 D1
q Challenge: Keeping the hottest O(N logN) items in the cache q Goal: React quickly and effectively to workload changes with minimal updates
Key-Value Cache Query Statistics Cache Management
PCIe
1 2 3 4 1 Data plane reports hot keys 2 Control plane compares loads of new hot and sampled cached keys 3 Control plane fetches values for keys to be inserted to the cache 4 Control plane inserts and evicts keys
KV Servers Front-end KV Cache
– Count-Min sketch: report new hot keys – Bloom filter: remove duplicated hot key reports
Per-key counters for each cached item Count-Min sketch pkt.key not cached cached
hot
Bloom filter report
Cache Lookup
32 64 96 128 9alue 6ize (Byte) 0.0 0.5 1.0 1.5 2.0 2.5 ThroughSut (B436) 16. 32. 48. 64. CacKe 6ize 0.0 0.5 1.0 1.5 2.0 2.5 TKrougKSut (B436)
(b) Throughput vs. cache size.
One can further increase the value sizes with more stages, recirculation, or mirroring.
Yes, it’s Billion Queries Per Sec, not a typo J
NetCache provides 3-10x throughput improvements. Throughput of a key-value storage rack with
uQiforP ziSf-0.9 ziSf-0.95 ziSf-0.99 WorNloDd DisWribuWioQ 0.0 0.5 1.0 1.5 2.0 ThroughSuW (BQPS)
1oCDche 1eWCDche(servers) 1eWCDche(cDche)
56
57
– P4 language spec – P4 dev tools and sample programs – P4 tutorials – List of papers regarding PISA, PISA Apps, and P4
– Language, target architecture, runtime API, applications
– To enhance PISA, P4, dev tools (e.g., for formal verification, equivalence check, automated test generation, and many more …)
58