Elastic RSS Co-Scheduling Packets and Cores Using Programmable NICs - PowerPoint PPT Presentation

Elastic RSS Co-Scheduling Packets and Cores Using Programmable NICs Alexander Rucker Tushar Swamy, Muhammad Shahbaz, and Kunle Olukotun Stanford University August 17, 2019

How do we meet tail latency constraints? 1

Existing systems have several limitations. Random Hashing • Load imbalance • Over provisioned Centralized Scheduling • Dedicated core • Limited throughput 2 NIC

Existing systems have several limitations. Random Hashing • Load imbalance • Over provisioned Centralized Scheduling • Dedicated core • Limited throughput 2 NIC NIC Sched

How do we scalably & CPU-efgiciently meet tail latency constraints? 3

eRSS uses all cores for useful work and runs at line rate. 4 eRSS

Design

eRSS’s packet processing maps to a PISA NIC with map-reduce extensions. 5 Programmable NIC On-chip Core (ARM or PowerPC) H o PHV D s e P t p a a r C r s s P e e r U r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block

1. Assign each packet to an application. • For example, use IP address or port number. 6 Programmable NIC On-chip Core (ARM or PowerPC) H PHV o D s e P t p a r a C r s s P e e r U r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block

2. Estimate the per-packet workload. • Can use any set of packet header fields (currently, only packet size). • Model is periodically trained by the CPU. 7 Programmable NIC On-chip Core (ARM or PowerPC) H PHV o D s e P t Workload p a Estimation r a C r s per s P e Application e r U r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block

3. Determine core count for the application. • Compare allocated cores to exponential moving average of workload. • Use heuristics and hysteresis to avoid ringing. 8 Programmable NIC On-chip Core (ARM or PowerPC) H PHV o D s e P t Workload Core p a Estimation Allocation r a C r s per per s P e Application Application e r U r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block

4. Select a virtual core. • Virtual cores within each application are allocated densely, starting at 0. • Packets are hashed & the best allocated core is chosen. 9 Programmable NIC On-chip Core (ARM or PowerPC) H PHV o D s Consistent e P t Workload Core Hashing p a with Weights Estimation Allocation r a C r s per per per s P e Application’s Application Application e r U Virtual Core r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block

5. Estimate queue depths. • Queues are estimated per-virtual core. • Estimates are used to adjust consistent hashing weights. 10 Programmable NIC On-chip Core (ARM or PowerPC) Update weights H (in 10µs) PHV o D s Consistent e P Queue-Depth t Workload Core Hashing p a Estimation with Weights Estimation Allocation r a C r s per per per per Application’s s P e Application’s Application Application Virtual Core e r U Virtual Core r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block

6. Map the virtual core to a physical core. • CPU assigns each physical core to an application as an active/slack core. 11 Programmable NIC On-chip Core (ARM or PowerPC) Update weights H (in 10µs) PHV o D s Consistent e P Queue-Depth t V2P Core Workload Core Hashing p a Estimation Mapping with Weights Estimation Allocation r a C r s per per per per per Application’s s P e Application’s Application Application Application Virtual Core e r U Virtual Core r s Map-Reduce Match-Action Pipeline Match-Action Pipeline Block • Look up ⟨ Application, Virtual Core ⟩ → Physical Core in match-action table.

1. An application needs additional headroom. Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc. 12 Host CPUs App1 L eRSS Manager

2. The core is initially running a batch job. Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc. 13 Host CPUs App1 L eRSS Manager

3. The sofuware manager starts and pins a sleeping thread to the core. Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc. 14 Host CPUs App1 eRSS Manager

4. When the NIC allocates a core, it wakes up the resident thread. Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc. 15 Host CPUs App1 Interrupt eRSS Manager

5. Cores can run any server sofuware, incl. distributed work stealing or preemption. Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc. 16 Host CPUs App1 eRSS Manager

6. Upon deallocation, the packet thread sleeps and the OS schedules a batch job. Run: Batch Run: Batch Sleep: Server Run: Server Linux Sched. Tick Poll NIC SW Alloc. Core NIC Interrupt NIC Dealloc. 17 Host CPUs App1 eRSS Manager

Preliminary Evaluation

We simulate eRSS’s performance on a synthetic model. • Packets have Poisson-distributed inter-arrival times. • Packet processing time correlates with size and added noise. 18 • Packet sizes are representative of Internet trafgic.

eRSS responds quickly to load variations. 1 Time (ms) Cores Allocated RSS Req. Trafgic (Gbps) 5 4 3 2 0 0 64 48 32 16 0 40 30 20 10 19

eRSS responds quickly to load variations. 1 Time (ms) Cores Allocated eRSS-a (90% load) RSS Req. Trafgic (Gbps) 5 4 3 2 0 0 64 48 32 16 0 40 30 20 10 19

eRSS responds quickly to load variations. 2 Time (ms) Cores Allocated eRSS-c (75% load) eRSS-a (90% load) RSS Req. Trafgic (Gbps) 5 4 3 1 0 0 64 48 32 16 0 40 30 20 10 19

eRSS deallocates slowly to ensure queues are drained. 2 Time (ms) Cores Allocated eRSS-c (75% load) eRSS-a (90% load) RSS Req. Trafgic (Gbps) 5 4 3 1 0 0 64 48 32 16 0 40 30 20 10 20 L

eRSS adds controllable tail latency. 0 SLO eRSS-c (75% load) eRSS-a (90% load) RSS CDF 100 10 1 0.8 0.6 0.4 0.2 1 21 L 0 . 1 Latency ( µs )

Future Work & Summary

• Core scheduling with Reinforcement Learning (RL) • Replace consistent hashing for distributing packets between cores. eRSS will be extended with ML. • Workload estimation • Use packet header fields and deep packet inspection to gather statistics. • Replace heuristics for adding/removing cores to an application. 22 • Efgicient core scheduling requires accurate workload estimates.

eRSS will be extended with ML. • Workload estimation • Use packet header fields and deep packet inspection to gather statistics. • Replace heuristics for adding/removing cores to an application. • Replace consistent hashing for distributing packets between cores. 22 • Efgicient core scheduling requires accurate workload estimates. • Core scheduling with Reinforcement Learning (RL)

eRSS meets tail latency constraints while saving cores. • Parameters control trade-ofg between core use and tail latency. 23 • eRSS runs at line rate using slight extensions to existing NICs. • eRSS is compatible with a variety of sofuware solutions. • eRSS can be extended with ML for automatic operation.

eRSS scalably & CPU-efgiciently meets tail latency constraints. Questions? 24

eRSS adds a controllable amount of additional queue depth. 2 Deepest Queue (kiB) eRSS-c (75% load) eRSS-a (90% load) RSS Req. Trafgic (Gbps) 5 4 3 1 0 0 30 20 10 0 40 30 20 10 Time (ms)

eRSS minimizes breaking flows. 0.7 0.8 0.9 1.0 0 2 4 6 8 10 CDF Break Counts eRSS-a (90% load) eRSS-c (75% load)

Elastic RSS Co-Scheduling Packets and Cores Using Programmable NICs - PowerPoint PPT Presentation

Elastic RSS Co-Scheduling Packets and Cores Using Programmable NICs Alexander Rucker Tushar Swamy, Muhammad Shahbaz, and Kunle Olukotun Stanford University August 17, 2019 How do we meet tail latency constraints? 1 Existing systems have

RSS Feeds RSS Feeds Your own personal magazine Your own personal magazine What are RSS

Ruby Monstas Session 14 Agenda Recap Standard Library: RSS Exercises Recap Recap: TodoList

Using Kieker with Elastic APM: An Experience Report Valentin Seifermann Duan Okanovi SSP

Monitor your containers with the Elastic Stack Monica Sarbu Monica Sarbu Team lead, Beats team

COURSE PLANNING 2019-2020 RSS Home Page http://www.rss.sd23.bc.ca/Pages/default.aspx 3 YEAR

rss r t Ptr rrs rst

Partnerships & Prospectus Data Manifesto - 2 Building Relations 3 Working with RSS

s rss st

A Theory of A Theory of Elastic Presentation Space Elastic Presentation Space Sheelagh

A Theory of A Theory of Elastic Presentation Space Elastic Presentation Space Sheelagh

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

Unconstrained Elastic Matching Unconstrained Elastic Matching and Eigen Eigen- -Deformations

Trapped Modes in Elastic Media for Zero Poisson Coefficient Three-dimensional elastic plate with

Transparent boundary conditions for the elastic Transparent boundary conditions for the elastic

Notes 1D Elastic Continuum From last class: elastic rod Required reading: linear

Synchronous Elastic Systems Synchronous Elastic Systems Mike Kishinevsky and Jordi Cortadella

ATLAST Assessing Teacher Learning About Science Teaching How Do You Know Whether They Are

Urban Forests Session 1 30 SECONDS What is a tree? a woody perennial plant, typically having

Print version Updated: 4 March 2020 Lecture #25 Dissolved Carbon Dioxide: Open & Closed

Cinematic Scientific Visualization in Houdini Kalina Borkiewicz + AJ Christensen Advanced

OpenCL-Based Design Pattern for Line Rate Packet Processing Jehandad Khan, Peter Athanas

Gaussian elimination: recording the transformations 2 3 2 3 = A U 1 4 5 4 5 2 3 2 3

Kalman Filter State Space Model Review Derivation x n +1 = F n x n + G n u n Examples y n

Sunday Homework 3 : an Diniohlet Allocation Model Latent Generative : Generative model

Elastic RSS Co-Scheduling Packets and Cores Using Programmable NICs - PowerPoint PPT Presentation

Elastic RSS Co-Scheduling Packets and Cores Using Programmable NICs Alexander Rucker Tushar Swamy, Muhammad Shahbaz, and Kunle Olukotun Stanford University August 17, 2019 How do we meet tail latency constraints? 1 Existing systems have

RSS Feeds RSS Feeds Your own personal magazine Your own personal magazine What are RSS

Ruby Monstas Session 14 Agenda Recap Standard Library: RSS Exercises Recap Recap: TodoList

Using Kieker with Elastic APM: An Experience Report Valentin Seifermann Duan Okanovi SSP

Monitor your containers with the Elastic Stack Monica Sarbu Monica Sarbu Team lead, Beats team

COURSE PLANNING 2019-2020 RSS Home Page http://www.rss.sd23.bc.ca/Pages/default.aspx 3 YEAR

rss r t Ptr rrs rst

Partnerships &amp; Prospectus Data Manifesto - 2 Building Relations 3 Working with RSS

s rss st

A Theory of A Theory of Elastic Presentation Space Elastic Presentation Space Sheelagh

A Theory of A Theory of Elastic Presentation Space Elastic Presentation Space Sheelagh

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

Unconstrained Elastic Matching Unconstrained Elastic Matching and Eigen Eigen- -Deformations

Trapped Modes in Elastic Media for Zero Poisson Coefficient Three-dimensional elastic plate with

Transparent boundary conditions for the elastic Transparent boundary conditions for the elastic

Notes 1D Elastic Continuum From last class: elastic rod Required reading: linear

Synchronous Elastic Systems Synchronous Elastic Systems Mike Kishinevsky and Jordi Cortadella

ATLAST Assessing Teacher Learning About Science Teaching How Do You Know Whether They Are

Urban Forests Session 1 30 SECONDS What is a tree? a woody perennial plant, typically having

Print version Updated: 4 March 2020 Lecture #25 Dissolved Carbon Dioxide: Open &amp; Closed

Cinematic Scientific Visualization in Houdini Kalina Borkiewicz + AJ Christensen Advanced

OpenCL-Based Design Pattern for Line Rate Packet Processing Jehandad Khan, Peter Athanas

Gaussian elimination: recording the transformations 2 3 2 3 = A U 1 4 5 4 5 2 3 2 3

Kalman Filter State Space Model Review Derivation x n +1 = F n x n + G n u n Examples y n

Sunday Homework 3 : an Diniohlet Allocation Model Latent Generative : Generative model

Partnerships & Prospectus Data Manifesto - 2 Building Relations 3 Working with RSS

Print version Updated: 4 March 2020 Lecture #25 Dissolved Carbon Dioxide: Open & Closed