Elastic RSS
Co-Scheduling Packets and Cores Using Programmable NICs
Alexander Rucker Tushar Swamy, Muhammad Shahbaz, and Kunle Olukotun Stanford University August 17, 2019
Elastic RSS Co-Scheduling Packets and Cores Using Programmable NICs - - PowerPoint PPT Presentation
Elastic RSS Co-Scheduling Packets and Cores Using Programmable NICs Alexander Rucker Tushar Swamy, Muhammad Shahbaz, and Kunle Olukotun Stanford University August 17, 2019 How do we meet tail latency constraints? 1 Existing systems have
Elastic RSS
Co-Scheduling Packets and Cores Using Programmable NICs
Alexander Rucker Tushar Swamy, Muhammad Shahbaz, and Kunle Olukotun Stanford University August 17, 2019
How do we meet
constraints?
1
Existing systems have several limitations.
Random Hashing
NIC
Centralized Scheduling
2
Existing systems have several limitations.
Random Hashing
NIC
Centralized Scheduling
NIC
Sched
2
How do we
meet tail latency constraints?
3
eRSS uses all cores for useful work and runs at line rate.
eRSS
4
Design
eRSS’s packet processing maps to a PISA NIC with map-reduce extensions.
P a r s e r
Match-Action Pipeline Map-Reduce Block Match-Action Pipeline
D e p a r s e r
Programmable NIC
On-chip Core
(ARM or PowerPC)
PHV
H
t C P U s 5
P a r s e r
Match-Action Pipeline Map-Reduce Block Match-Action Pipeline
D e p a r s e r
Programmable NIC
On-chip Core
(ARM or PowerPC)
PHV
H
t C P U s
6
P a r s e r
Match-Action Pipeline Map-Reduce Block Match-Action Pipeline
D e p a r s e r
Programmable NIC
On-chip Core
(ARM or PowerPC)
PHV
H
t C P U s
Workload Estimation
per Application
7
P a r s e r
Match-Action Pipeline Map-Reduce Block Match-Action Pipeline
D e p a r s e r
Programmable NIC
On-chip Core
(ARM or PowerPC)
PHV
H
t C P U s
Workload Estimation
per Application
Core Allocation
per Application
8
P a r s e r
Match-Action Pipeline Map-Reduce Block Match-Action Pipeline
D e p a r s e r
Programmable NIC
On-chip Core
(ARM or PowerPC)
PHV
H
t C P U s
Workload Estimation
per Application
Core Allocation
per Application
Consistent Hashing with Weights
per Application’s Virtual Core
9
P a r s e r
Match-Action Pipeline Map-Reduce Block Match-Action Pipeline
D e p a r s e r
Programmable NIC
On-chip Core
(ARM or PowerPC)
PHV
H
t C P U s
Workload Estimation
per Application
Core Allocation
per Application
Consistent Hashing with Weights
per Application’s Virtual Core
Queue-Depth Estimation
per Application’s Virtual Core Update weights (in 10µs)
10
P a r s e r
Match-Action Pipeline Map-Reduce Block Match-Action Pipeline
D e p a r s e r
Programmable NIC
On-chip Core
(ARM or PowerPC)
PHV
H
t C P U s
Workload Estimation
per Application
Core Allocation
per Application
Consistent Hashing with Weights
per Application’s Virtual Core
Queue-Depth Estimation
per Application’s Virtual Core Update weights (in 10µs)
V2P Core Mapping
per Application
11
Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc.
App1
Host CPUs
eRSS
Manager
L
12
Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc.
App1
Host CPUs
eRSS
Manager
L
13
Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc.
App1
Host CPUs
eRSS
Manager
14
Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc.
App1
Host CPUs
eRSS
Interrupt
Manager
15
Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc.
App1
Host CPUs
eRSS
Manager
16
Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc.
App1
Host CPUs
eRSS
Manager
17
Preliminary Evaluation
We simulate eRSS’s performance on a synthetic model.
18
eRSS responds quickly to load variations.
10 20 30 40 16 32 48 64 1 2 3 4 5
RSS Cores Allocated Time (ms)
19
eRSS responds quickly to load variations.
10 20 30 40 16 32 48 64 1 2 3 4 5
RSS eRSS-a (90% load) Cores Allocated Time (ms)
19
eRSS responds quickly to load variations.
10 20 30 40 16 32 48 64 1 2 3 4 5
RSS eRSS-a (90% load) eRSS-c (75% load) Cores Allocated Time (ms)
19
eRSS deallocates slowly to ensure queues are drained.
10 20 30 40 16 32 48 64 1 2 3 4 5
RSS eRSS-a (90% load) eRSS-c (75% load) Cores Allocated Time (ms)
L
20
eRSS adds controllable tail latency.
1 0.2 0.4 0.6 0.8 0.1 1 10 100 CDF Latency (µs) RSS eRSS-a (90% load) eRSS-c (75% load)
L
SLO
21
Future Work & Summary
eRSS will be extended with ML.
22
eRSS will be extended with ML.
22
eRSS meets tail latency constraints while saving cores.
23
meets tail latency constraints.
Questions?
24
eRSS adds a controllable amount of additional queue depth.
10 20 30 40 10 20 30 1 2 3 4 5
RSS eRSS-a (90% load) eRSS-c (75% load) Deepest Queue (kiB) Time (ms)
eRSS minimizes breaking flows.
0.7 0.8 0.9 1.0 2 4 6 8 10 CDF Break Counts eRSS-a (90% load) eRSS-c (75% load)