Swayam Distributed Autoscaling for Machine Learning as a Service - PowerPoint PPT Presentation

Swayam Distributed Autoscaling for Machine Learning as a Service Sameh Elnikety, Arpan Gujarati, Kathryn S. McKinley Yuxiong He Björn B. Brandenburg 1

Machine Learning as a Service (MLaaS) Machine Learning Amazon Machine Learning Data Science & Machine Learning Google Cloud AI 2

Machine Learning as a Service (MLaaS) 1. Training Machine Learning = + Amazon Machine Learning Trained Untrained Dataset Model model Data Science & 2. Prediction Machine Learning + = Google Cloud AI Answer Trained Query Model 2

Machine Learning as a Service (MLaaS) This work 2. Prediction + = Answer Trained Query Model Models are already trained and available for prediction 2

Swayam Distributed autoscaling 2. Prediction + = of the compute resources needed for prediction serving Answer Trained Query Model inside the MLaaS infrastructure 3

Prediction serving ( application perspective) MLaaS Provider Application / End User image "cat" Image classifier 4

Prediction serving ( provider perspective) Finite compute resources MLaaS Provider "Backends" for prediction Lots of trained models! 5

Prediction serving ( provider perspective) Finite compute resources MLaaS Provider "Backends" for prediction Application / End User Lots of trained models! (1) New prediction request for the Multiple request pink model dispatchers "Frontends" (2) A frontend receives the request 5

Prediction serving ( provider perspective) Finite compute resources MLaaS Provider "Backends" for prediction Application / End User (4) The backend fetches the pink model Lots of trained models! (3) The request is dispatched to an idle backend (1) New prediction request for the Multiple request pink model dispatchers "Frontends" (2) A frontend receives the request 5

Prediction serving ( provider perspective) Finite compute resources MLaaS Provider "Backends" for prediction Application / End User (4) The backend fetches the pink model (5) The request Lots of trained models! outcome is predicted (3) The request is dispatched to an idle backend (1) New prediction request for the Multiple request (6) The response is pink model sent back through dispatchers "Frontends" the frontend (2) A frontend receives the request 5

Prediction serving ( objectives ) Finite compute resources MLaaS Provider "Backends" for prediction Application / End User Application / End User Lots of trained models! Multiple request dispatchers "Frontends" 6

Prediction serving ( objectives ) Low latency, SLAs Finite compute resources MLaaS Provider MLaaS Provider "Backends" for prediction Application / End User Application / End User Application / End User Resource e ffi ciency Lots of trained models! Multiple request dispatchers "Frontends" 6

Static partitioning of trained models 7

Static partitioning of trained models MLaaS Provider MLaaS Provider The trained models partitioned among the finite backends 7

Static partitioning of trained models MLaaS Provider MLaaS Provider The trained models partitioned among Application / End User the finite backends No need to fetch and install the pink model Multiple request dispatchers "Frontends" 7

Static partitioning of trained models MLaaS Provider MLaaS Provider The trained models partitioned among Application / End User the finite backends No need to fetch and install the pink model Problem: Not all models are used at all times Multiple request dispatchers "Frontends" 7

Static partitioning of trained models MLaaS Provider MLaaS Provider The trained models partitioned among Application / End User the finite backends No need to fetch and install the pink model Problem: Not all models are used at all times Multiple request dispatchers "Frontends" Problem: Many more models than backends, high memory footprint per model 7

Static partitioning of trained models Low latency, SLAs MLaaS Provider MLaaS Provider The trained models partitioned among Application / End User Resource e ffi ciency the finite backends Static partitioning is infeasible No need to fetch and install the pink model Problem: Not all models are used at all times Multiple request dispatchers "Frontends" Problem: Many more models than backends, high memory footprint per model 8

Classical approach: autoscaling The number of active backends # Active backends Request load for for the pink model the pink model are automatically scaled up or down based on load Time 9

Classical approach: autoscaling The number of active backends # Active backends Request load for for the pink model the pink model are automatically scaled up or down based on load With ideal autoscaling ... Enough backends to guarantee low latency # Active backends over time is Time minimized for resource e ffi ciency 9

Autoscaling for MLaaS is challenging [1/3] 10

Autoscaling for MLaaS is challenging [1/3] Finite compute resources MLaaS Provider "Backends" for prediction (4) The backend fetches the pink model Lots of trained models! (5) The request outcome is predicted Multiple request dispatchers "Frontends" 10

Autoscaling for MLaaS is challenging [1/3] Finite compute resources MLaaS Provider "Backends" for prediction Challenge (4) The backend fetches the pink model Lots of trained models! Provisioning >> Execution Time (4) Time (5) (~ a few seconds) (~ 10ms to 500ms) (5) The request outcome is predicted Requirement Predictive autoscaling to Multiple request dispatchers "Frontends" hide the provisioning latency 10

Autoscaling for MLaaS is challenging [2/3] MLaaS architecture is large-scale, multi-tiered Hardware broker Frontends Backends [ VMs, containers ] 11

Autoscaling for MLaaS is challenging [2/3] MLaaS architecture is large-scale, multi-tiered Challenge Multiple frontends with Hardware partial information about broker the workload Frontends Requirement Fast, coordination-free, globally-consistent autoscaling decisions on the frontends Backends [ VMs, containers ] 11

Autoscaling for MLaaS is challenging [3/3] Strict , model-specific SLAs on response times "99% of requests must complete under 500ms" "99.9% of requests must complete under 1s" "[A] 95% of requests "[B] Tolerate up to 25% must complete under increase in request rates 850ms" without violating [A]" 12

Autoscaling for MLaaS is challenging [3/3] Strict , model-specific SLAs Challenge on response times No closed-form solutions to "99% of requests must get response-time distributions complete under 500ms" for SLA-aware autoscaling "99.9% of requests must complete under 1s" Requirement "[A] 95% of requests "[B] Tolerate up to 25% Accurate waiting-time and must complete under increase in request rates execution-time distributions 850ms" without violating [A]" 12

} Swayam: model-driven distributed autoscaling Challenges Provisioning >> Execution Time (4) Time (5) (~ a few seconds) (~ 10ms to 500ms) We address these challenges Multiple frontends with by leveraging specific partial information about ML workload characteristics the workload and design an analytical model No closed-form solutions to for resource estimation get response-time distributions that allows distributed and for SLA-aware autoscaling predictive autoscaling 13

Outline 1. System architecture, key ideas 2. Analytical model for resource estimation 3. Evaluation results 14

System architecture 15

System architecture Backends dedicated for the pink model Application / End User Application / End User Backends dedicated for the blue model Application / End User Hardware broker Application / End User Backends dedicated for the Global pool Frontends green model of backends 15

System architecture Backends dedicated for the pink model Application / End User Application / End User Backends dedicated for the blue model Application / End User Hardware broker Application / End User Backends dedicated for the Global pool Frontends green model of backends Objective: dedicated set of backends should dynamically scale 1. If load decreases, extra backends go back to the global pool (for resource e ffi ciency) 2. If load increases, new backends are set up in advance (for SLA compliance) 15

System architecture Let's focus on the pink model Backends dedicated Application / End User for the pink model Application / End User Application / End User Hardware broker Application / End User Frontends Objective: dedicated set of backends should dynamically scale 1. If load decreases, extra backends go back to the global pool (for resource e ffi ciency) 2. If load increases, new backends are set up in advance (for SLA compliance) 15

Key idea 1: Assign states to each backend 16

Key idea 1: Assign states to each backend In the cold global pool warm Dedicated to a trained model 16

Key idea 1: Assign states to each backend Haven't executed a request for a while In the cold global pool not-in-use warm Dedicated to a in-use trained model Maybe executing a request 16

Swayam Distributed Autoscaling for Machine Learning as a Service - PowerPoint PPT Presentation

Swayam Distributed Autoscaling for Machine Learning as a Service Sameh Elnikety, Arpan Gujarati, Kathryn S. McKinley Yuxiong He Bjrn B. Brandenburg 1 Machine Learning as a Service (MLaaS) Machine Learning Amazon Machine Learning Data

ri RISE to the Challenges of AI Systems Joseph E. Gonzalez Assistant Professor, UC Berkeley

Design Patterns Dependency Injection Oliver Haase 1 Motivation A simple, motivating example

Data Data is PROOF! No Time to Train | April 2 Participation Fruit, vegetable, white milk, or

Solving Linear and Integer Programs Robert E. Bixby ILOG, Inc. and Rice University Ed Rothberg

How many servings of CoolWhip in an 8 oz container? A. less than 10 B. between 10 and 20 C.

Fitting parametric distributions using R : the fitdistrplus package M. L. Delignette-Muller - CNRS

VOTD: Buffer Overflow Engineering Secure Software Last Revised: August 17, 2020 SWEN-331:

Week 2 - Monday What did we talk about last time? Software development Data

Week 4 - Friday What did we talk about last time? Examples switch statements

Slide 1 Hello Everyone, Welcome to the Centers for Medicare & Medicare Advantage Medicaid

Yashar Ganjali Department of Computer Science University of Toronto HotI 2012 Santa Clara, CA

Apache James: more than emails in the cloud Ioan Eugen Stan Berlin Buzzwords 2012 About myself

Co vid 19 Bio E ngine e ring Adviso ry Bo ard We binar T he rape utic s Covid-19

AIRS impact on analysis and forecast of extreme precipitation events in the tropics with a

drought defined impacts causes atmospheric blocking measures of drought: Drought Index A

Surface Observations We now look at some hourly surface observations to study the frontal

The Challenge of Natural Hazards This PowerPoint will cover information on: Natural Hazards

6th Grade Weather & Climate and Natural Hazards 2015-10-15 www.njctl.org Slide 3 / 161

ENSC 408: Lab 6 Radar Imagery Interpretation October 22 nd , 2019 Lab 4 Marks General Comments:

China Regional Reanalysis : One-year Preliminary Experiments and Evaluation of First Stage (1998

Last-Mile Hazard Warning System in Sri Lanka: Lessons Leaned from the Pilot Project LIRNE asia

Outline 1. Introduction 2. Data and Method 3. Analysis Results 4. Summary 5. Conclusions 6.

Jungho Im, PhD (ersgis@unist.ac.kr) School of Urban and Environmental Engineering Ulsan National

Sea Level Rise A Grid Submerged November 3, 2015 Seth Mullendore Project Manager Clean

Swayam Distributed Autoscaling for Machine Learning as a Service - PowerPoint PPT Presentation

Swayam Distributed Autoscaling for Machine Learning as a Service Sameh Elnikety, Arpan Gujarati, Kathryn S. McKinley Yuxiong He Bjrn B. Brandenburg 1 Machine Learning as a Service (MLaaS) Machine Learning Amazon Machine Learning Data

ri RISE to the Challenges of AI Systems Joseph E. Gonzalez Assistant Professor, UC Berkeley

Design Patterns Dependency Injection Oliver Haase 1 Motivation A simple, motivating example

Data Data is PROOF! No Time to Train | April 2 Participation Fruit, vegetable, white milk, or

Solving Linear and Integer Programs Robert E. Bixby ILOG, Inc. and Rice University Ed Rothberg

How many servings of CoolWhip in an 8 oz container? A. less than 10 B. between 10 and 20 C.

Fitting parametric distributions using R : the fitdistrplus package M. L. Delignette-Muller - CNRS

VOTD: Buffer Overflow Engineering Secure Software Last Revised: August 17, 2020 SWEN-331:

Week 2 - Monday What did we talk about last time? Software development Data

Week 4 - Friday What did we talk about last time? Examples switch statements

Slide 1 Hello Everyone, Welcome to the Centers for Medicare &amp; Medicare Advantage Medicaid

Yashar Ganjali Department of Computer Science University of Toronto HotI 2012 Santa Clara, CA

Apache James: more than emails in the cloud Ioan Eugen Stan Berlin Buzzwords 2012 About myself

Co vid 19 Bio E ngine e ring Adviso ry Bo ard We binar T he rape utic s Covid-19

AIRS impact on analysis and forecast of extreme precipitation events in the tropics with a

drought defined impacts causes atmospheric blocking measures of drought: Drought Index A

Surface Observations We now look at some hourly surface observations to study the frontal

The Challenge of Natural Hazards This PowerPoint will cover information on: Natural Hazards

6th Grade Weather &amp; Climate and Natural Hazards 2015-10-15 www.njctl.org Slide 3 / 161

ENSC 408: Lab 6 Radar Imagery Interpretation October 22 nd , 2019 Lab 4 Marks General Comments:

China Regional Reanalysis : One-year Preliminary Experiments and Evaluation of First Stage (1998

Last-Mile Hazard Warning System in Sri Lanka: Lessons Leaned from the Pilot Project LIRNE asia

Outline 1. Introduction 2. Data and Method 3. Analysis Results 4. Summary 5. Conclusions 6.

Jungho Im, PhD (ersgis@unist.ac.kr) School of Urban and Environmental Engineering Ulsan National

Sea Level Rise A Grid Submerged November 3, 2015 Seth Mullendore Project Manager Clean

Slide 1 Hello Everyone, Welcome to the Centers for Medicare & Medicare Advantage Medicaid

6th Grade Weather & Climate and Natural Hazards 2015-10-15 www.njctl.org Slide 3 / 161