Distributed Machine Learning with a Serverless Architecture Hao Wang - - PowerPoint PPT Presentation

distributed machine learning with a serverless
SMART_READER_LITE
LIVE PREVIEW

Distributed Machine Learning with a Serverless Architecture Hao Wang - - PowerPoint PPT Presentation

INFOCOM19, Paris, FR Distributed Machine Learning with a Serverless Architecture Hao Wang 1 , Di Niu 2 , Baochun Li 1 1 University of Toronto, 2 University of Alberta What is machine learning? Deep Learning Machine Learning Numerical


slide-1
SLIDE 1

Distributed Machine Learning with a Serverless Architecture

Hao Wang1, Di Niu2, Baochun Li1

1University of Toronto, 2University of Alberta

INFOCOM’19, Paris, FR

slide-2
SLIDE 2
slide-3
SLIDE 3

What is machine learning?

slide-4
SLIDE 4

Deep Learning

slide-5
SLIDE 5

Machine Learning

5

Gradients Numerical optimization

slide-6
SLIDE 6

Objective Data

ML Workflow

6

Model Design Model Tuning Training & Evaluation

😮 😈

Datasets … Convergence Loss rate …

Resource Reservation

Objective Budget

slide-7
SLIDE 7

Our Key Insights

  • Most current ML training jobs are data parallel
  • Model quality and resource investment have a nonlinear relation
  • ML training is inevitably a trial-and-error process

7

slide-8
SLIDE 8

IaaS PaaS Pricing Per hour Per hour Maintanance By users By providers Examples AWS EC2, Google Cloud Compute … Azure ML Studio, Google Cloud ML Engine …

Distributed ML Infrastructure

8

slide-9
SLIDE 9

Serverless?

9

IaaS PaaS Serverless Pricing Per hour Per hour Per call Maintanance By users By providers By providers Examples AWS EC2 Azure ML Studio AWS Lambda

slide-10
SLIDE 10

Serverless Computing?

Stateless function

  • Only input and output, no intermediate states

10

slide-11
SLIDE 11

Go Serverless?

Pro:

  • 1. Flexible concurrency
  • 2. Instant response
  • 3. Easy to deploy
  • 4. Cheap? Runtime * MemSize

Con:

  • 1. Execution model is too simple
  • 2. Runtime limitations (~15min)
  • 3. Communication overhead

11

slide-12
SLIDE 12

λ

ML Training on Serverless?

  • MapReduce on Serverless Cloud (PyWren, [SoCC’17])
  • Video processing on Serverless Cloud (Sprocket [SoCC’18])
slide-13
SLIDE 13

Stochastic Gradient Descent (SGD)

Input Samples

θ j = θ j +α(yi − h

θ(xi))xi j

13

slide-14
SLIDE 14

Mini-batch SGD

Input Samples

θ j = θ j + a b

k=i i+b−1

∑ (yk − h

θ(xk ))x j k

14

slide-15
SLIDE 15

Parameter Server

  • Model replicas on workers
  • Servers update parameters

Li, Mu, et al. Scaling distributed machine learning with the parameter server. OSDI'14

slide-16
SLIDE 16

SGD on Lambda

θ j = θ j + a b

k=i i+b−1

∑ (yk − h

θ(xk ))x j k

θ j = θ j + a b

k=i i+b−1

∑ (yk − h

θ(xk ))x j k

θ j = θ j + a b

k=i i+b−1

∑ (yk − h

θ(xk ))x j k

Input Samples

  • r

KV Storage

16

Function Function Function

slide-17
SLIDE 17
  • r

ML Training on Lambda

Input Samples KV Storage

17

Func. Func. Func.

slide-18
SLIDE 18

Toy Example

  • AWS Lambda
  • 20 functions
  • 150 functions
  • X functions (dynamic # of func.)
  • S3 storage
  • EC2 c5.2xlarge
  • 8 CPUs, 16GB mem
  • Local storage
  • Workload
  • A logistic regression model
slide-19
SLIDE 19

Toy Example

  • Loss value v.s. training time
  • Loss value v.s. monetary cost

20 functions 8-core EC2 150 functions X functions Loss value 0.010 0.015 Time (s) 100 200 300 Cost ($) 0.005 0.010 0.015 0.020 0.025 0.030

slide-20
SLIDE 20

Toy Example

Slowest, no cheap Fastest, expensive Fast, cheap

  • The first epoch: 120 functions
  • The last epoch: 10 functions
  • Intermediate epochs: 20 functions

X functions:

slide-21
SLIDE 21

Challenges

  • Functions on Serverless
  • Limitation on performance and deployment
  • Dynamic Resource Provisioning
  • Speed v.s. cost (given a budget, how fast could be?)
slide-22
SLIDE 22

Siren

  • Hybrid Synchronous Parallel (HSP)
  • Experience-Driven Resource Scheduler
slide-23
SLIDE 23

Cloud

Architecture

User- Defined Model API Libs

DRL Agent Function Manager Local Client Scheduler

resource scheme

Step 1

function status

Step 3

action states code package

Stateless Functions

slide-24
SLIDE 24

Enforce Parallelism on Siren

slide-25
SLIDE 25

Synchronous or Asynchronous

Synchronous training Asynchronous training

slide-26
SLIDE 26

Hybrid Synchronous Parallel (HSP)

… epoch t epoch t+1 time Fetching input Computing Updating parameters mini-batch …

ft,i ft,i+1 ft+1,i

slide-27
SLIDE 27

Experience-Driven Scheduler

slide-28
SLIDE 28

Toy Example - Find the X

Slowest, no cheap Fastest, expensive Fast, cheap

slide-29
SLIDE 29

Deep Reinforcement Learning

Environment Agent

Stateless Functions

Policy parameters Features

θ

Action

at

Reward

rt

State

st−1

Policy

π(at|st−1, θ)

slide-30
SLIDE 30

State

st = (t, ℓt, Pt, PF

t , PC t , PU t , ut, wt, bt)

slide-31
SLIDE 31

Action

Approximating with Gaussian distribution

at = (nt, mt) nt, mt ∈ ℤ+

π(a|s, θ) = 1 σ(s, θ) 2π exp( − (a − μ(s, θ)2 2σ(s, θ)2 )

nt × mt choices ~ 138,000 actions on AWS

Policy

π(at|st−1, θ)

<latexit sha1_base64="FYbupf5GD4w9OLOvMRG1RFwmhU=">ACG3icbVDLSsNAFJ3UV62vqEs3wSJU0JIUQZcFNy4r2Ac0IUymk3bo5MHMjVBi/sONv+LGhSKuBf+jZM2BW09MHDmnHu59x4v5kyCaX5rpZXVtfWN8mZla3tnd0/fP+jIKBGEtknEI9HzsKSchbQNDjtxYLiwO0642vc797T4VkUXgHk5g6AR6GzGcEg5JcvWHrGYHGEaen+LMhYf5R2ZuCudWdjYXbBhRwNmpq1fNujmFsUysglRgZarf9qDiCQBDYFwLGXfMmNwUiyAEU6zip1IGmMyxkPaVzTEAZVOr0tM06UMjD8SKgXgjFVf3ekOJByEniqMt9TLnq5+J/XT8C/clIWxgnQkMwG+Qk3IDLyoIwBE5QAnyiCiWBqV4OMsMAEVJwVFYK1ePIy6Tqlm3bi+qzUYRxkdoWNUQxa6RE10g1qojQh6RM/oFb1pT9qL9q59zEpLWtFziP5A+/oBmwSiUQ=</latexit><latexit sha1_base64="FYbupf5GD4w9OLOvMRG1RFwmhU=">ACG3icbVDLSsNAFJ3UV62vqEs3wSJU0JIUQZcFNy4r2Ac0IUymk3bo5MHMjVBi/sONv+LGhSKuBf+jZM2BW09MHDmnHu59x4v5kyCaX5rpZXVtfWN8mZla3tnd0/fP+jIKBGEtknEI9HzsKSchbQNDjtxYLiwO0642vc797T4VkUXgHk5g6AR6GzGcEg5JcvWHrGYHGEaen+LMhYf5R2ZuCudWdjYXbBhRwNmpq1fNujmFsUysglRgZarf9qDiCQBDYFwLGXfMmNwUiyAEU6zip1IGmMyxkPaVzTEAZVOr0tM06UMjD8SKgXgjFVf3ekOJByEniqMt9TLnq5+J/XT8C/clIWxgnQkMwG+Qk3IDLyoIwBE5QAnyiCiWBqV4OMsMAEVJwVFYK1ePIy6Tqlm3bi+qzUYRxkdoWNUQxa6RE10g1qojQh6RM/oFb1pT9qL9q59zEpLWtFziP5A+/oBmwSiUQ=</latexit><latexit sha1_base64="FYbupf5GD4w9OLOvMRG1RFwmhU=">ACG3icbVDLSsNAFJ3UV62vqEs3wSJU0JIUQZcFNy4r2Ac0IUymk3bo5MHMjVBi/sONv+LGhSKuBf+jZM2BW09MHDmnHu59x4v5kyCaX5rpZXVtfWN8mZla3tnd0/fP+jIKBGEtknEI9HzsKSchbQNDjtxYLiwO0642vc797T4VkUXgHk5g6AR6GzGcEg5JcvWHrGYHGEaen+LMhYf5R2ZuCudWdjYXbBhRwNmpq1fNujmFsUysglRgZarf9qDiCQBDYFwLGXfMmNwUiyAEU6zip1IGmMyxkPaVzTEAZVOr0tM06UMjD8SKgXgjFVf3ekOJByEniqMt9TLnq5+J/XT8C/clIWxgnQkMwG+Qk3IDLyoIwBE5QAnyiCiWBqV4OMsMAEVJwVFYK1ePIy6Tqlm3bi+qzUYRxkdoWNUQxa6RE10g1qojQh6RM/oFb1pT9qL9q59zEpLWtFziP5A+/oBmwSiUQ=</latexit><latexit sha1_base64="FYbupf5GD4w9OLOvMRG1RFwmhU=">ACG3icbVDLSsNAFJ3UV62vqEs3wSJU0JIUQZcFNy4r2Ac0IUymk3bo5MHMjVBi/sONv+LGhSKuBf+jZM2BW09MHDmnHu59x4v5kyCaX5rpZXVtfWN8mZla3tnd0/fP+jIKBGEtknEI9HzsKSchbQNDjtxYLiwO0642vc797T4VkUXgHk5g6AR6GzGcEg5JcvWHrGYHGEaen+LMhYf5R2ZuCudWdjYXbBhRwNmpq1fNujmFsUysglRgZarf9qDiCQBDYFwLGXfMmNwUiyAEU6zip1IGmMyxkPaVzTEAZVOr0tM06UMjD8SKgXgjFVf3ekOJByEniqMt9TLnq5+J/XT8C/clIWxgnQkMwG+Qk3IDLyoIwBE5QAnyiCiWBqV4OMsMAEVJwVFYK1ePIy6Tqlm3bi+qzUYRxkdoWNUQxa6RE10g1qojQh6RM/oFb1pT9qL9q59zEpLWtFziP5A+/oBmwSiUQ=</latexit>
slide-32
SLIDE 32

Reward

At each epoch ,

t

rt = − βPt, t = 1,…, T − 1

At the final epoch ,

T

regularizer a constant as the final reward/penalty reach the expected loss value,

  • r use up all budget
slide-33
SLIDE 33

Training

Policy gradient:

max

T

t=1

γtrt, γ ∈ (0,1]

Maximize cumulative discounted reward:

∇θ𝔽π[

T

t=1

γtrt] = 𝔽π[∇θln π(a|s, θ)qπ(s, a)]

policy function expected reward with and a s discount factor

slide-34
SLIDE 34

DRL

Environment Agent

Stateless Functions

Policy parameters Features

θ

Action

at

Reward

rt

State

st−1

Policy

π(at|st−1, θ)

slide-35
SLIDE 35

Cloud

Workflow

User- Defined Model API Libs

DRL Agent Function Manager Local Client Scheduler

resource scheme

Step 1

function status

Step 3 Step 5

action

Step 4

states code package

Stateless Functions

Step 2

slide-36
SLIDE 36

Evaluation

  • Simulation: OpenAI Gym
  • Testbed: AWS Lambda + AWS EC2
  • 44.3% ⬇ on job completion time
slide-37
SLIDE 37

Simulation - overview

  • Workload: mini-batched SGD algorithms
  • Goal: DRL agent v.s. Grid search (# of functions)
slide-38
SLIDE 38

Simulation - grid search

GS-300 GS-200 GS-100 GS-50 Total rewards −100 100 Number of functions 1000 2000 GS Siren 10.03% 12.87% 36% Time (s) 2000 3000 4000 5000 Budget ($) 100 200 300

slide-39
SLIDE 39

Simulation

slide-40
SLIDE 40

Simulation - DRL training

Siren-300 Siren-200 Siren-100 Number of functions 500 1000 Epoch of ML training 100 200 300 Siren-300 Total rewards −200 −100 100 Iteration 100 200 300

slide-41
SLIDE 41

Testbed

  • Siren on AWS Lambda v.s. MXNet on EC2
  • m4.large: 2 vCPU, 8GB memory, $0.1/hr
  • m4.xlarge: 4 vCPU, 16GB memory, $0.2/hr
  • m4.2xlarge: 8 vCPU, 32GB memory, $0.4/hr
  • Workload
  • LeNet on MNIST
  • CNN on movie review
  • Linear Classification on click-through prediction dataset
slide-42
SLIDE 42

Testbed - Siren and EC2 on LeNet

m4.large m4.xlarge m4.2xlarge Siren Cost($) 0.02 0.04 0.06 0.08 Time (s) 50 100 150 200 250 Number of EC2 instances 2 4 6 8 Siren

slide-43
SLIDE 43

Testbed - DRL training

# of functions Memory # of functions 500 1000 Memory (MB) 200 400 600 800 Training epoch of LeNet 2 4 6 8 Total rewards −100 100 Iteration 100 200 300

slide-44
SLIDE 44

Testbed - time v.s. cost

EC2 Siren Time (s) 100 200 300 400 Cost ($) 0.02 0.04 0.06 0.08

slide-45
SLIDE 45

Testbed - given the same cost

50 100 LeNet 1000 2000 CNN 1000 2000 3000 Linear Classfication m4.2xlarge Siren Time (s)

slide-46
SLIDE 46

Conclusion

  • Siren: Distributed Machine Learning with a Serverless Architecture
  • Hybrid Synchronous Parallel (HSP)
  • Experience-Driven Resource Scheduler
  • Evaluation
  • Simulation & Testbed
  • 44.3% ⬇ on job completion time
slide-47
SLIDE 47

Q&A Thank You