Elastic RSS Co-Scheduling Packets and Cores Using Programmable NICs - - PowerPoint PPT Presentation

elastic rss
SMART_READER_LITE
LIVE PREVIEW

Elastic RSS Co-Scheduling Packets and Cores Using Programmable NICs - - PowerPoint PPT Presentation

Elastic RSS Co-Scheduling Packets and Cores Using Programmable NICs Alexander Rucker Tushar Swamy, Muhammad Shahbaz, and Kunle Olukotun Stanford University August 17, 2019 How do we meet tail latency constraints? 1 Existing systems have


slide-1
SLIDE 1

Elastic RSS

Co-Scheduling Packets and Cores Using Programmable NICs

Alexander Rucker Tushar Swamy, Muhammad Shahbaz, and Kunle Olukotun Stanford University August 17, 2019

slide-2
SLIDE 2

How do we meet

tail latency

constraints?

1

slide-3
SLIDE 3

Existing systems have several limitations.

Random Hashing

  • Load imbalance
  • Over provisioned

NIC

Centralized Scheduling

  • Dedicated core
  • Limited throughput

2

slide-4
SLIDE 4

Existing systems have several limitations.

Random Hashing

  • Load imbalance
  • Over provisioned

NIC

Centralized Scheduling

  • Dedicated core
  • Limited throughput

NIC

Sched

2

slide-5
SLIDE 5

How do we

scalably & CPU-efgiciently

meet tail latency constraints?

3

slide-6
SLIDE 6

eRSS uses all cores for useful work and runs at line rate.

eRSS

4

slide-7
SLIDE 7

Design

slide-8
SLIDE 8

eRSS’s packet processing maps to a PISA NIC with map-reduce extensions.

P a r s e r

Match-Action Pipeline Map-Reduce Block Match-Action Pipeline

D e p a r s e r

Programmable NIC

On-chip Core

(ARM or PowerPC)

PHV

H

  • s

t C P U s 5

slide-9
SLIDE 9
  • 1. Assign each packet to an application.

P a r s e r

Match-Action Pipeline Map-Reduce Block Match-Action Pipeline

D e p a r s e r

Programmable NIC

On-chip Core

(ARM or PowerPC)

PHV

H

  • s

t C P U s

  • For example, use IP address or port number.

6

slide-10
SLIDE 10
  • 2. Estimate the per-packet workload.

P a r s e r

Match-Action Pipeline Map-Reduce Block Match-Action Pipeline

D e p a r s e r

Programmable NIC

On-chip Core

(ARM or PowerPC)

PHV

H

  • s

t C P U s

Workload Estimation

per Application

  • Can use any set of packet header fields (currently, only packet size).
  • Model is periodically trained by the CPU.

7

slide-11
SLIDE 11
  • 3. Determine core count for the application.

P a r s e r

Match-Action Pipeline Map-Reduce Block Match-Action Pipeline

D e p a r s e r

Programmable NIC

On-chip Core

(ARM or PowerPC)

PHV

H

  • s

t C P U s

Workload Estimation

per Application

Core Allocation

per Application

  • Compare allocated cores to exponential moving average of workload.
  • Use heuristics and hysteresis to avoid ringing.

8

slide-12
SLIDE 12
  • 4. Select a virtual core.

P a r s e r

Match-Action Pipeline Map-Reduce Block Match-Action Pipeline

D e p a r s e r

Programmable NIC

On-chip Core

(ARM or PowerPC)

PHV

H

  • s

t C P U s

Workload Estimation

per Application

Core Allocation

per Application

Consistent Hashing with Weights

per Application’s Virtual Core

  • Virtual cores within each application are allocated densely, starting at 0.
  • Packets are hashed & the best allocated core is chosen.

9

slide-13
SLIDE 13
  • 5. Estimate queue depths.

P a r s e r

Match-Action Pipeline Map-Reduce Block Match-Action Pipeline

D e p a r s e r

Programmable NIC

On-chip Core

(ARM or PowerPC)

PHV

H

  • s

t C P U s

Workload Estimation

per Application

Core Allocation

per Application

Consistent Hashing with Weights

per Application’s Virtual Core

Queue-Depth Estimation

per Application’s Virtual Core Update weights (in 10µs)

  • Queues are estimated per-virtual core.
  • Estimates are used to adjust consistent hashing weights.

10

slide-14
SLIDE 14
  • 6. Map the virtual core to a physical core.

P a r s e r

Match-Action Pipeline Map-Reduce Block Match-Action Pipeline

D e p a r s e r

Programmable NIC

On-chip Core

(ARM or PowerPC)

PHV

H

  • s

t C P U s

Workload Estimation

per Application

Core Allocation

per Application

Consistent Hashing with Weights

per Application’s Virtual Core

Queue-Depth Estimation

per Application’s Virtual Core Update weights (in 10µs)

V2P Core Mapping

per Application

  • CPU assigns each physical core to an application as an active/slack core.
  • Look up ⟨Application, Virtual Core⟩ → Physical Core in match-action table.

11

slide-15
SLIDE 15
  • 1. An application needs additional headroom.

Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc.

App1

Host CPUs

eRSS

Manager

L

12

slide-16
SLIDE 16
  • 2. The core is initially running a batch job.

Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc.

App1

Host CPUs

eRSS

Manager

L

13

slide-17
SLIDE 17
  • 3. The sofuware manager starts and pins a sleeping thread to the core.

Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc.

App1

Host CPUs

eRSS

Manager

14

slide-18
SLIDE 18
  • 4. When the NIC allocates a core, it wakes up the resident thread.

Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc.

App1

Host CPUs

eRSS

Interrupt

Manager

15

slide-19
SLIDE 19
  • 5. Cores can run any server sofuware, incl. distributed work stealing or preemption.

Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc.

App1

Host CPUs

eRSS

Manager

16

slide-20
SLIDE 20
  • 6. Upon deallocation, the packet thread sleeps and the OS schedules a batch job.

Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc.

App1

Host CPUs

eRSS

Manager

17

slide-21
SLIDE 21

Preliminary Evaluation

slide-22
SLIDE 22

We simulate eRSS’s performance on a synthetic model.

  • Packets have Poisson-distributed inter-arrival times.
  • Packet sizes are representative of Internet trafgic.
  • Packet processing time correlates with size and added noise.

18

slide-23
SLIDE 23

eRSS responds quickly to load variations.

10 20 30 40 16 32 48 64 1 2 3 4 5

  • Req. Trafgic (Gbps)

RSS Cores Allocated Time (ms)

19

slide-24
SLIDE 24

eRSS responds quickly to load variations.

10 20 30 40 16 32 48 64 1 2 3 4 5

  • Req. Trafgic (Gbps)

RSS eRSS-a (90% load) Cores Allocated Time (ms)

19

slide-25
SLIDE 25

eRSS responds quickly to load variations.

10 20 30 40 16 32 48 64 1 2 3 4 5

  • Req. Trafgic (Gbps)

RSS eRSS-a (90% load) eRSS-c (75% load) Cores Allocated Time (ms)

19

slide-26
SLIDE 26

eRSS deallocates slowly to ensure queues are drained.

10 20 30 40 16 32 48 64 1 2 3 4 5

  • Req. Trafgic (Gbps)

RSS eRSS-a (90% load) eRSS-c (75% load) Cores Allocated Time (ms)

L

20

slide-27
SLIDE 27

eRSS adds controllable tail latency.

1 0.2 0.4 0.6 0.8 0.1 1 10 100 CDF Latency (µs) RSS eRSS-a (90% load) eRSS-c (75% load)

L

SLO

21

slide-28
SLIDE 28

Future Work & Summary

slide-29
SLIDE 29

eRSS will be extended with ML.

  • Workload estimation
  • Efgicient core scheduling requires accurate workload estimates.
  • Use packet header fields and deep packet inspection to gather statistics.
  • Core scheduling with Reinforcement Learning (RL)
  • Replace heuristics for adding/removing cores to an application.
  • Replace consistent hashing for distributing packets between cores.

22

slide-30
SLIDE 30

eRSS will be extended with ML.

  • Workload estimation
  • Efgicient core scheduling requires accurate workload estimates.
  • Use packet header fields and deep packet inspection to gather statistics.
  • Core scheduling with Reinforcement Learning (RL)
  • Replace heuristics for adding/removing cores to an application.
  • Replace consistent hashing for distributing packets between cores.

22

slide-31
SLIDE 31

eRSS meets tail latency constraints while saving cores.

  • Parameters control trade-ofg between core use and tail latency.
  • eRSS runs at line rate using slight extensions to existing NICs.
  • eRSS is compatible with a variety of sofuware solutions.
  • eRSS can be extended with ML for automatic operation.

23

slide-32
SLIDE 32

eRSS scalably & CPU-efgiciently

meets tail latency constraints.

Questions?

24

slide-33
SLIDE 33

eRSS adds a controllable amount of additional queue depth.

10 20 30 40 10 20 30 1 2 3 4 5

  • Req. Trafgic (Gbps)

RSS eRSS-a (90% load) eRSS-c (75% load) Deepest Queue (kiB) Time (ms)

slide-34
SLIDE 34

eRSS minimizes breaking flows.

0.7 0.8 0.9 1.0 2 4 6 8 10 CDF Break Counts eRSS-a (90% load) eRSS-c (75% load)