Swayam Distributed Autoscaling for Machine Learning as a Service - - PowerPoint PPT Presentation

swayam
SMART_READER_LITE
LIVE PREVIEW

Swayam Distributed Autoscaling for Machine Learning as a Service - - PowerPoint PPT Presentation

Swayam Distributed Autoscaling for Machine Learning as a Service Sameh Elnikety, Arpan Gujarati, Kathryn S. McKinley Yuxiong He Bjrn B. Brandenburg 1 Machine Learning as a Service (MLaaS) Machine Learning Amazon Machine Learning Data


slide-1
SLIDE 1

Arpan Gujarati, Björn B. Brandenburg Sameh Elnikety, Yuxiong He Kathryn S. McKinley

Swayam

Distributed Autoscaling for Machine Learning as a Service

1

slide-2
SLIDE 2

Machine Learning as a Service (MLaaS)

Data Science & Machine Learning Amazon Machine Learning Machine Learning Google Cloud AI

2

slide-3
SLIDE 3

Machine Learning as a Service (MLaaS)

Data Science & Machine Learning Amazon Machine Learning Machine Learning Google Cloud AI

+ =

Trained Model Untrained model Dataset

  • 1. Training

+ =

Trained Model

  • 2. Prediction

Query Answer

2

slide-4
SLIDE 4

Machine Learning as a Service (MLaaS)

+ =

Trained Model

  • 2. Prediction

Query Answer

Models are already trained and available for prediction

This work

2

slide-5
SLIDE 5

Swayam

inside the MLaaS infrastructure

Distributed autoscaling

  • f the compute resources

needed for prediction serving

+ =

Trained Model

  • 2. Prediction

Query Answer

3

slide-6
SLIDE 6

Prediction serving (application perspective)

MLaaS Provider

Image classifier "cat" image Application / End User

4

slide-7
SLIDE 7

Lots of trained models!

MLaaS Provider

Finite compute resources "Backends" for prediction

Prediction serving (provider perspective)

5

slide-8
SLIDE 8

Lots of trained models!

MLaaS Provider

Finite compute resources "Backends" for prediction

Application / End User

(1) New prediction request for the pink model (2) A frontend receives the request

Multiple request dispatchers "Frontends"

Prediction serving (provider perspective)

5

slide-9
SLIDE 9

Lots of trained models!

MLaaS Provider

Finite compute resources "Backends" for prediction

Application / End User

(1) New prediction request for the pink model (3) The request is dispatched to an idle backend (2) A frontend receives the request (4) The backend fetches the pink model

Multiple request dispatchers "Frontends"

Prediction serving (provider perspective)

5

slide-10
SLIDE 10

Lots of trained models!

MLaaS Provider

Finite compute resources "Backends" for prediction

Application / End User

(1) New prediction request for the pink model (3) The request is dispatched to an idle backend (2) A frontend receives the request (4) The backend fetches the pink model (5) The request

  • utcome is predicted

(6) The response is sent back through the frontend

Multiple request dispatchers "Frontends"

Prediction serving (provider perspective)

5

slide-11
SLIDE 11

Lots of trained models!

MLaaS Provider

Application / End User Application / End User Multiple request dispatchers "Frontends"

Finite compute resources "Backends" for prediction

Prediction serving (objectives)

6

slide-12
SLIDE 12

Lots of trained models!

MLaaS Provider MLaaS Provider

Application / End User Resource efficiency Low latency, SLAs Application / End User Application / End User Multiple request dispatchers "Frontends"

Finite compute resources "Backends" for prediction

Prediction serving (objectives)

6

slide-13
SLIDE 13

Static partitioning of trained models

7

slide-14
SLIDE 14

MLaaS Provider MLaaS Provider

Static partitioning of trained models

The trained models partitioned among the finite backends

7

slide-15
SLIDE 15

MLaaS Provider MLaaS Provider

Application / End User Multiple request dispatchers "Frontends"

Static partitioning of trained models

No need to fetch and install the pink model

The trained models partitioned among the finite backends

7

slide-16
SLIDE 16

MLaaS Provider MLaaS Provider

Application / End User Multiple request dispatchers "Frontends"

Problem: Not all models are used at all times

Static partitioning of trained models

No need to fetch and install the pink model

The trained models partitioned among the finite backends

7

slide-17
SLIDE 17

MLaaS Provider MLaaS Provider

Application / End User Multiple request dispatchers "Frontends"

Problem: Not all models are used at all times Problem: Many more models than backends, high memory footprint per model

Static partitioning of trained models

No need to fetch and install the pink model

The trained models partitioned among the finite backends

7

slide-18
SLIDE 18

MLaaS Provider

Static partitioning of trained models

MLaaS Provider

Multiple request dispatchers "Frontends"

Problem: Not all models are used at all times Problem: Many more models than backends, high memory footprint per model No need to fetch and install the pink model

The trained models partitioned among the finite backends

Application / End User Resource efficiency Low latency, SLAs

Static partitioning is infeasible

8

slide-19
SLIDE 19

Classical approach: autoscaling

The number of active backends are automatically scaled up or down based on load Time # Active backends for the pink model

Request load for the pink model

9

slide-20
SLIDE 20

Classical approach: autoscaling

The number of active backends are automatically scaled up or down based on load Time # Active backends for the pink model

Request load for the pink model Enough backends to guarantee low latency # Active backends over time is minimized for resource efficiency

With ideal autoscaling ...

9

slide-21
SLIDE 21

Autoscaling for MLaaS is challenging [1/3]

10

slide-22
SLIDE 22

Autoscaling for MLaaS is challenging [1/3]

Lots of trained models!

MLaaS Provider

Finite compute resources "Backends" for prediction

(4) The backend fetches the pink model (5) The request

  • utcome is predicted

Multiple request dispatchers "Frontends"

10

slide-23
SLIDE 23

Autoscaling for MLaaS is challenging [1/3]

Lots of trained models!

MLaaS Provider

Finite compute resources "Backends" for prediction

(4) The backend fetches the pink model (5) The request

  • utcome is predicted

Multiple request dispatchers "Frontends"

Provisioning Time (4) Execution Time (5)

>>

(~ a few seconds) (~ 10ms to 500ms)

Challenge Predictive autoscaling to hide the provisioning latency Requirement

10

slide-24
SLIDE 24

MLaaS architecture is large-scale, multi-tiered

Frontends Backends [ VMs, containers ] Hardware broker

Autoscaling for MLaaS is challenging [2/3]

11

slide-25
SLIDE 25

MLaaS architecture is large-scale, multi-tiered

Frontends Backends [ VMs, containers ] Hardware broker

Autoscaling for MLaaS is challenging [2/3]

Challenge Fast, coordination-free, globally-consistent autoscaling decisions on the frontends Requirement Multiple frontends with partial information about the workload

11

slide-26
SLIDE 26

"99% of requests must complete under 500ms"

Strict, model-specific SLAs

  • n response times

"99.9% of requests must complete under 1s" "[B] Tolerate up to 25% increase in request rates without violating [A]" "[A] 95% of requests must complete under 850ms"

Autoscaling for MLaaS is challenging [3/3]

12

slide-27
SLIDE 27

"99% of requests must complete under 500ms"

Strict, model-specific SLAs

  • n response times

"99.9% of requests must complete under 1s"

No closed-form solutions to get response-time distributions for SLA-aware autoscaling Challenge Accurate waiting-time and execution-time distributions Requirement

"[B] Tolerate up to 25% increase in request rates without violating [A]" "[A] 95% of requests must complete under 850ms"

Autoscaling for MLaaS is challenging [3/3]

12

slide-28
SLIDE 28

}

Provisioning Time (4) Execution Time (5)

>>

(~ a few seconds) (~ 10ms to 500ms)

Challenges Multiple frontends with partial information about the workload No closed-form solutions to get response-time distributions for SLA-aware autoscaling

We address these challenges

by leveraging specific ML workload characteristics and design an analytical model for resource estimation that allows distributed and predictive autoscaling

Swayam: model-driven distributed autoscaling

13

slide-29
SLIDE 29

Outline

  • 1. System architecture, key ideas
  • 2. Analytical model for resource estimation
  • 3. Evaluation results

14

slide-30
SLIDE 30

System architecture

15

slide-31
SLIDE 31

Application / End User Application / End User Application / End User Application / End User Hardware broker Frontends Global pool

  • f backends

Backends dedicated for the pink model Backends dedicated for the blue model Backends dedicated for the green model

System architecture

15

slide-32
SLIDE 32

Application / End User Application / End User Application / End User Application / End User Hardware broker Frontends Global pool

  • f backends

Backends dedicated for the pink model Backends dedicated for the blue model Backends dedicated for the green model

System architecture

  • 1. If load decreases, extra backends go back to the global pool (for resource efficiency)
  • 2. If load increases, new backends are set up in advance (for SLA compliance)

Objective: dedicated set of backends should dynamically scale

15

slide-33
SLIDE 33

Application / End User Application / End User Application / End User Application / End User Hardware broker Frontends

Backends dedicated for the pink model

System architecture

Let's focus on the pink model

  • 1. If load decreases, extra backends go back to the global pool (for resource efficiency)
  • 2. If load increases, new backends are set up in advance (for SLA compliance)

Objective: dedicated set of backends should dynamically scale

15

slide-34
SLIDE 34

Key idea 1: Assign states to each backend

16

slide-35
SLIDE 35

cold warm

In the global pool Dedicated to a trained model

Key idea 1: Assign states to each backend

16

slide-36
SLIDE 36

cold warm

In the global pool Dedicated to a trained model

in-use not-in-use

Haven't executed a request for a while Maybe executing a request

Key idea 1: Assign states to each backend

16

slide-37
SLIDE 37

cold warm

In the global pool Dedicated to a trained model

in-use not-in-use busy idle

Haven't executed a request for a while Maybe executing a request Waiting for a request Executing a request

Key idea 1: Assign states to each backend

16

slide-38
SLIDE 38

cold warm

In the global pool Dedicated to a trained model

in-use not-in-use busy idle

Haven't executed a request for a while Maybe executing a request Waiting for a request Executing a request

Dedicated, but not used due to reduced load Can be safely garbage collected (scale-in)

Key idea 1: Assign states to each backend

... or easily transitioned to an in- use state (scale-out)

16

slide-39
SLIDE 39

cold warm

In the global pool Dedicated to a trained model

in-use not-in-use busy idle

Haven't executed a request for a while Maybe executing a request Waiting for a request Executing a request

Dedicated, but not used due to reduced load Can be safely garbage collected (scale-in)

Key idea 1: Assign states to each backend

... or easily transitioned to an in- use state (scale-out) How do frontends know which dedicated backends to use, and which to not use?

16

slide-40
SLIDE 40

Backends dedicated for the pink model

Key idea 2: Order the dedicated set of backends

1 2 3 4 6 5 11 10 9 8 7 12

17

slide-41
SLIDE 41

Backends dedicated for the pink model

Key idea 2: Order the dedicated set of backends

1 2 3 4 6 5 11 10 9 8 7 12

If 9 backends are sufficient for SLA compliance ...

17

slide-42
SLIDE 42

Backends dedicated for the pink model

Key idea 2: Order the dedicated set of backends

1 2 3 4 6 5 11 10 9 8 7 12

= warm in-use busy/idle = warm not-in-use

Backends dedicated for the pink model

1 2 3 4 6 5 11 10 9 8 7 12

If 9 backends are sufficient for SLA compliance ... backends 10-12 transition to not-in-use state frontends use backends 1-9

17

slide-43
SLIDE 43

Backends dedicated for the pink model

Key idea 2: Order the dedicated set of backends

1 2 3 4 6 5 11 10 9 8 7 12

= warm in-use busy/idle = warm not-in-use

Backends dedicated for the pink model

1 2 3 4 6 5 11 10 9 8 7 12

If 9 backends are sufficient for SLA compliance ... backends 10-12 transition to not-in-use state frontends use backends 1-9 How do frontends know how many backends are sufficient?

17

slide-44
SLIDE 44

Frontends

Key idea 3: Swayam instance on every frontend

Incoming requests Swayam instance

computes globally consistent minimum # backends necessary for SLA compliance

Backends dedicated for the pink model

1 2 3 4 6 5 11 10 9 8 7 12

= warm in-use busy/idle = warm not-in-use

18

slide-45
SLIDE 45

Outline

  • 1. System architecture, key ideas
  • 2. Analytical model for resource estimation
  • 3. Evaluation results

19

slide-46
SLIDE 46

Making globally-consistent decisions

What is the minimum # backends required for SLA compliance?

at each frontend (Swayam instance)

20

slide-47
SLIDE 47

Making globally-consistent decisions

What is the minimum # backends required for SLA compliance?

at each frontend (Swayam instance)

  • 1. Expected request execution time
  • 2. Expected request waiting time
  • 3. Total request load

20

slide-48
SLIDE 48

Making globally-consistent decisions

}

leverage ML workload characteristics What is the minimum # backends required for SLA compliance?

at each frontend (Swayam instance)

  • 1. Expected request execution time
  • 2. Expected request waiting time
  • 3. Total request load

20

slide-49
SLIDE 49

Determining expected request execution times

Studied execution traces of 15 popular services hosted on Microsoft Azure's MLaaS platform

5 10 15 20 25 30 35 50 100 150 200 250 300 350 400 Normalized Frequency (%) Service Times (ms) Trace 1 Data from trace (bin width = 10)

21

slide-50
SLIDE 50

Determining expected request execution times

Studied execution traces of 15 popular services hosted on Microsoft Azure's MLaaS platform

  • Fixed-sized feature vectors
  • Input-independent control flow
  • Non-deterministic machine & OS

events main sources of variability

Variation is low

5 10 15 20 25 30 35 50 100 150 200 250 300 350 400 Normalized Frequency (%) Service Times (ms) Trace 1 Data from trace (bin width = 10)

21

slide-51
SLIDE 51

Determining expected request execution times

Studied execution traces of 15 popular services hosted on Microsoft Azure's MLaaS platform

  • Fixed-sized feature vectors
  • Input-independent control flow
  • Non-deterministic machine & OS

events main sources of variability

Variation is low Modeled using log-normal distributions

5 10 15 20 25 30 35 50 100 150 200 250 300 350 400 Normalized Frequency (%) Service Times (ms) Trace 1 Data from trace (bin width = 10) 5 10 15 20 25 30 35 50 100 150 200 250 300 350 400 Normalized Frequency (%) Service Times (ms) Trace 1 Data from trace (bin width = 10) Fitted lognormal distribution

21

slide-52
SLIDE 52

Determining expected request waiting times

load balancing (LB)

22

slide-53
SLIDE 53

Determining expected request waiting times

load balancing (LB)

100 200 300 400 500 600 700 30 40 50 60 70 80 90 100 Waiting Time (ms) #Backends Threshold (350ms)

22

slide-54
SLIDE 54

Determining expected request waiting times

load balancing (LB)

100 200 300 400 500 600 700 30 40 50 60 70 80 90 100 Waiting Time (ms) #Backends Threshold (350ms) 100 200 300 400 500 600 700 30 40 50 60 70 80 90 100 Waiting Time (ms) #Backends Threshold (350ms) Global Scheduling Partitioned Scheduling

Global and partitioned perform well, but there are implementation tradeoffs

22

slide-55
SLIDE 55

Determining expected request waiting times

load balancing (LB)

100 200 300 400 500 600 700 30 40 50 60 70 80 90 100 Waiting Time (ms) #Backends Threshold (350ms) 100 200 300 400 500 600 700 30 40 50 60 70 80 90 100 Waiting Time (ms) #Backends Threshold (350ms) Global Scheduling Partitioned Scheduling 100 200 300 400 500 600 700 30 40 50 60 70 80 90 100 Waiting Time (ms) #Backends Threshold (350ms) Global Scheduling Partitioned Scheduling Join-Idle-Queue

JIQ doesn't result in good tail waiting times Global and partitioned perform well, but there are implementation tradeoffs

22

slide-56
SLIDE 56

Determining expected request waiting times

load balancing (LB)

100 200 300 400 500 600 700 30 40 50 60 70 80 90 100 Waiting Time (ms) #Backends Threshold (350ms) 100 200 300 400 500 600 700 30 40 50 60 70 80 90 100 Waiting Time (ms) #Backends Threshold (350ms) Global Scheduling Partitioned Scheduling 100 200 300 400 500 600 700 30 40 50 60 70 80 90 100 Waiting Time (ms) #Backends Threshold (350ms) Global Scheduling Partitioned Scheduling Join-Idle-Queue 100 200 300 400 500 600 700 30 40 50 60 70 80 90 100 Waiting Time (ms) #Backends Threshold (350ms) Global Scheduling Partitioned Scheduling Join-Idle-Queue Random Dispatch

JIQ doesn't result in good tail waiting times Global and partitioned perform well, but there are implementation tradeoffs Random dispatch gives much better tail waiting times

22

slide-57
SLIDE 57

Determining expected request waiting times

load balancing (LB)

100 200 300 400 500 600 700 30 40 50 60 70 80 90 100 Waiting Time (ms) #Backends Threshold (350ms) 100 200 300 400 500 600 700 30 40 50 60 70 80 90 100 Waiting Time (ms) #Backends Threshold (350ms) Global Scheduling Partitioned Scheduling 100 200 300 400 500 600 700 30 40 50 60 70 80 90 100 Waiting Time (ms) #Backends Threshold (350ms) Global Scheduling Partitioned Scheduling Join-Idle-Queue 100 200 300 400 500 600 700 30 40 50 60 70 80 90 100 Waiting Time (ms) #Backends Threshold (350ms) Global Scheduling Partitioned Scheduling Join-Idle-Queue Random Dispatch

JIQ doesn't result in good tail waiting times Global and partitioned perform well, but there are implementation tradeoffs Random dispatch gives much better tail waiting times

22

We use a LB policy based

  • n random dispatch!
slide-58
SLIDE 58

in the near future, to account for high provisioning times

Determining the total request load

23

slide-59
SLIDE 59

}

in the near future, to account for high provisioning times

Determining the total request load

Frontends Hardware broker

L

Since the broker spreads requests uniformly among each frontends

L' = L/F F

Total # frontends

L' L'

Total request rate

23

slide-60
SLIDE 60

}

in the near future, to account for high provisioning times

Determining the total request load

Frontends Hardware broker

L L' = L/F F L' L'

  • Predicts L' for near future

Each Swayam instance

Depends on the time to setup a new backend

23

slide-61
SLIDE 61

}

in the near future, to account for high provisioning times

Determining the total request load

Frontends Hardware broker

L L' = L/F F L' L'

  • Predicts L' for near future

Determined from broker / through a gossip protocol

Each Swayam instance

  • Given F

, computes L = F x L'

23

slide-62
SLIDE 62

Making globally-consistent decisions

What is the minimum # backends required for SLA compliance?

at each frontend (Swayam instance)

  • 1. Expected request execution time
  • 2. Expected request waiting time
  • 3. Total request load

24

slide-63
SLIDE 63

SLA-aware resource estimation

For each trained model

Response-Time Threshold Service Level Burst Threshold

RTmax SLmin U n = min # backends

25

slide-64
SLIDE 64

SLA-aware resource estimation

For each trained model

Response-Time Threshold Service Level Burst Threshold

RTmax SLmin U n = min # backends

Waiting Time Distribution Execution Time Distribution

Response Time Modeling Load SLmin percentile response time

< RTmax? n = 1 n++ No Yes x U

25

slide-65
SLIDE 65

SLA-aware resource estimation

For each trained model

Response-Time Threshold Service Level Burst Threshold

RTmax SLmin U n = min # backends

Waiting Time Distribution Execution Time Distribution

Response Time Modeling Load SLmin percentile response time

< RTmax? n = 1 n++ No Yes x U

Closed-form expression for percentile response time (see the appendix) Convolution

25

slide-66
SLIDE 66

SLA-aware resource estimation

For each trained model

Response-Time Threshold Service Level Burst Threshold

RTmax SLmin U n = min # backends

Waiting Time Distribution Execution Time Distribution

Response Time Modeling Load SLmin percentile response time

< RTmax? n = 1 n++ No Yes x U

Amplified based on the burst threshold

25

slide-67
SLIDE 67

SLA-aware resource estimation

For each trained model

Response-Time Threshold Service Level Burst Threshold

RTmax SLmin U n = min # backends

Waiting Time Distribution Execution Time Distribution

Response Time Modeling Load SLmin percentile response time

< RTmax? n = 1 n++ No Yes x U

Initialization Retry, as long as not SLA compliant

25

Compute percentile response time for n

slide-68
SLIDE 68

Frontends

Incoming requests Swayam instance

computes globally consistent minimum # backends necessary for SLA compliance

Backends dedicated for the pink model

1 2 3 4 6 5 11 10 9 8 7 12

= warm in-use busy/idle = warm not-in-use

26

Swayam Framework

slide-69
SLIDE 69

Outline

  • 1. System architecture, key ideas
  • 2. Analytical model for resource estimation
  • 3. Evaluation results

27

slide-70
SLIDE 70

Evaluation setup

  • Prototype in C++ on top of Apache Thrift

➡ 100 backends per service ➡ 8 frontends ➡ 1 broker ➡ 1 server (for simulating the clients)

28

slide-71
SLIDE 71

Evaluation setup

  • Prototype in C++ on top of Apache Thrift

➡ 100 backends per service ➡ 8 frontends ➡ 1 broker ➡ 1 server (for simulating the clients)

  • Workload

➡ 15 production service traces (Microsoft Azure MLaaS) ➡ Three-hour traces (request arrival times and computation times) ➡ Query computation & model setup times emulated by spinning

28

slide-72
SLIDE 72

SLA configuration for each model

  • Response-time threshold RTmax = 5C

➡ C denotes the mean computation time for the model

  • Desired service level SLmin = 99%

➡ 99% of the requests must have response times under RTmax

  • Burst threshold U = 2x

➡ Tolerate increase in request rate by up to 100%

  • Initially, 5 pre-provisioned backends

29

slide-73
SLIDE 73

Baseline: Clairvoyant Autoscaler (ClairA)

➡ It knows the processing time of each request beforehand ➡ It can travel back in time to provision a backend ➡ "Deadline-driven" approach to minimize resource waste

30

slide-74
SLIDE 74

Baseline: Clairvoyant Autoscaler (ClairA)

  • ClairA1 assumes zero setup times, immediate scale-ins

➡ Reflects the size of the workload

➡ It knows the processing time of each request beforehand ➡ It can travel back in time to provision a backend ➡ "Deadline-driven" approach to minimize resource waste

30

slide-75
SLIDE 75

Baseline: Clairvoyant Autoscaler (ClairA)

  • ClairA1 assumes zero setup times, immediate scale-ins

➡ Reflects the size of the workload

  • ClairA2 assumes non-zero setup times, lazy scale-ins

➡ Swayam-like

➡ It knows the processing time of each request beforehand ➡ It can travel back in time to provision a backend ➡ "Deadline-driven" approach to minimize resource waste

30

slide-76
SLIDE 76

Baseline: Clairvoyant Autoscaler (ClairA)

  • ClairA1 assumes zero setup times, immediate scale-ins

➡ Reflects the size of the workload

  • ClairA2 assumes non-zero setup times, lazy scale-ins

➡ Swayam-like

  • Both ClairA1 and ClairA2 depend on RTmax, but not on SLmin and U

➡ It knows the processing time of each request beforehand ➡ It can travel back in time to provision a backend ➡ "Deadline-driven" approach to minimize resource waste

30

slide-77
SLIDE 77

Resource usage vs. SLA compliance

31

slide-78
SLIDE 78

Resource usage vs. SLA compliance

0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance)

31

slide-79
SLIDE 79

Resource usage vs. SLA compliance

0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance) 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance)

31

slide-80
SLIDE 80

Resource usage vs. SLA compliance

0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance) 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance) 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance)

31

slide-81
SLIDE 81

Resource usage vs. SLA compliance

0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance) 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance) 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance)

97% 98% 64% 95% 100% 100% 100% 100% 100% 100% 100% 87% 91% 89% 97%

Frequency of SLA Compliance

31

slide-82
SLIDE 82

Resource usage vs. SLA compliance

0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance) 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance) 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance)

97% 98% 64% 95% 100% 100% 100% 100% 100% 100% 100% 87% 91% 89% 97%

Swayam performs much better than ClairA2 in terms of resource efficiency

31

slide-83
SLIDE 83

Resource usage vs. SLA compliance

0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance) 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance) 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance)

97% 98% 64% 95% 100% 100% 100% 100% 100% 100% 100% 87% 91% 89% 97%

Swayam is resource efficient but at the cost of SLA compliance

31

slide-84
SLIDE 84

Resource usage vs. SLA compliance

0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance) 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance) 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance)

97% 98% 64% 95% 100% 100% 100% 100% 100% 100% 100% 87% 91% 89% 97%

Swayam is resource efficient but at the cost of SLA compliance

31

slide-85
SLIDE 85

Resource usage vs. SLA compliance

0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance) 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance) 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resource usage (normalized) Trace IDs ClairA1 ClairA2 Swayam (frequency of SLA compliance)

97% 98% 64% 95% 100% 100% 100% 100% 100% 100% 100% 87% 91% 89% 97%

Swayam seems to perform poorly because of a very bursty trace

31

slide-86
SLIDE 86

Summary

  • Perfect SLA, irrespective of the input workload, is too expensive

➡ in terms of resource usage (as modeled by ClairA)

32

slide-87
SLIDE 87

Summary

  • Perfect SLA, irrespective of the input workload, is too expensive

➡ in terms of resource usage (as modeled by ClairA)

  • To ensure resource efficiency, practical systems

➡ need to trade off some SLA compliance ➡ while managing client expectations

32

slide-88
SLIDE 88

Summary

  • Perfect SLA, irrespective of the input workload, is too expensive

➡ in terms of resource usage (as modeled by ClairA)

  • To ensure resource efficiency, practical systems

➡ need to trade off some SLA compliance ➡ while managing client expectations

  • Swayam strikes a good balance, for MLaaS prediction serving

➡ by realizing significant resource savings ➡ at the cost of occasional SLA violations

32

slide-89
SLIDE 89

Summary

  • Perfect SLA, irrespective of the input workload, is too expensive

➡ in terms of resource usage (as modeled by ClairA)

  • To ensure resource efficiency, practical systems

➡ need to trade off some SLA compliance ➡ while managing client expectations

  • Swayam strikes a good balance, for MLaaS prediction serving

➡ by realizing significant resource savings ➡ at the cost of occasional SLA violations

  • Easy integration into any existing request-response architecture

32

slide-90
SLIDE 90

Thank you. Questions?

33