Distributed Machine Learning with a Serverless Architecture Hao Wang - PowerPoint PPT Presentation

INFOCOM’19, Paris, FR Distributed Machine Learning with a Serverless Architecture Hao Wang 1 , Di Niu 2 , Baochun Li 1 1 University of Toronto, 2 University of Alberta

What is machine learning?

Deep Learning

Machine Learning Numerical optimization Gradients � 5

ML Workflow Datasets Convergence Objective Objective … Loss rate Data Budget … 😈 Resource Model Model Training & Reservation Design Tuning Evaluation 😮 � 6

Our Key Insights • Most current ML training jobs are data parallel • Model quality and resource investment have a nonlinear relation • ML training is inevitably a trial-and-error process � 7

Distributed ML Infrastructure IaaS PaaS Pricing Per hour Per hour Maintanance By users By providers AWS EC2, Azure ML Studio, … Examples Google Cloud Google Cloud ML Compute … Engine … � 8

Serverless? IaaS PaaS Serverless Pricing Per hour Per hour Per call Maintanance By users By providers By providers Examples AWS EC2 Azure ML Studio AWS Lambda � 9

Serverless Computing? • Only input and output, no intermediate states Stateless function � 10

Go Serverless? Con: Pro: 1. Execution model is too simple 1. Flexible concurrency 2. Runtime limitations (~15min) 2. Instant response 3. Communication overhead 3. Easy to deploy 4. Cheap? Runtime * MemSize � 11

λ ML Training on Serverless? - MapReduce on Serverless Cloud (PyWren, [SoCC’17]) - Video processing on Serverless Cloud (Sprocket [SoCC’18])

Stochastic Gradient Descent (SGD) Input Samples θ j = θ j + α ( y i − h θ ( x i )) x i j � 13

Mini-batch SGD Input Samples i + b − 1 θ j = θ j + a ∑ ( y k − h θ ( x k )) x j k b k = i � 14

Parameter Server • Model replicas on workers • Servers update parameters Li, Mu, et al. Scaling distributed machine learning with the parameter server. OSDI'14

SGD on Lambda Input Samples i + b − 1 θ j = θ j + a ∑ ( y k − h θ ( x k )) x j k Function KV Storage b k = i i + b − 1 θ j = θ j + a ∑ ( y k − h θ ( x k )) x j k Function b k = i or … i + b − 1 θ j = θ j + a ∑ ( y k − h θ ( x k )) x j k Function b k = i � 16

ML Training on Lambda Input Samples KV Storage Func. Func. … or Func. � 17

Toy Example • Workload - A logistic regression model • AWS Lambda • EC2 c5.2xlarge - 20 functions - 8 CPUs, 16GB mem - 150 functions - Local storage - X functions (dynamic # of func.) - S3 storage

Toy Example • Loss value v.s. training time • Loss value v.s. monetary cost 20 functions 8-core EC2 150 functions X functions 0.015 Loss value 0.010 0 100 200 300 0.005 0.010 0.015 0.020 0.025 0.030 Time (s) Cost ($)

Toy Example Slowest, no cheap Fastest, expensive Fast, cheap • The first epoch: 120 functions • The last epoch: 10 functions X functions: • Intermediate epochs: 20 functions

Challenges • Functions on Serverless - Limitation on performance and deployment • Dynamic Resource Provisioning - Speed v.s. cost (given a budget, how fast could be?)

• Hybrid Synchronous Parallel (HSP) Siren • Experience-Driven Resource Scheduler

Architecture Stateless code User- Functions Defined package Model API Libs Cloud Step 3 Step 1 function status resource scheme Local Client action states DRL Function Scheduler Manager Agent

Enforce Parallelism on Siren

Synchronous or Asynchronous Synchronous training Asynchronous training

Hybrid Synchronous Parallel (HSP) Fetching input Computing Updating parameters f t +1 ,i … f t,i mini-batch … f t,i +1 epoch t epoch t+1 time

Experience-Driven Scheduler

Toy Example - Find the X Slowest, no cheap Fastest, expensive Fast, cheap

Deep Reinforcement Learning Reward r t Policy Agent Environment π ( a t | s t − 1 , θ ) Features Action Stateless a t Functions Policy parameters θ State s t − 1

State s t = ( t , ℓ t , P t , P F t , P C t , P U t , u t , w t , b t )

<latexit sha1_base64="FYbupf5GD4w9OLOvMRG1RFwmhU=">ACG3icbVDLSsNAFJ3UV62vqEs3wSJU0JIUQZcFNy4r2Ac0IUymk3bo5MHMjVBi/sONv+LGhSKuBf+jZM2BW09MHDmnHu59x4v5kyCaX5rpZXVtfWN8mZla3tnd0/fP+jIKBGEtknEI9HzsKSchbQNDjtxYLiwO0642vc797T4VkUXgHk5g6AR6GzGcEg5JcvWHrGYHGEaen+LMhYf5R2ZuCudWdjYXbBhRwNmpq1fNujmFsUysglRgZarf9qDiCQBDYFwLGXfMmNwUiyAEU6zip1IGmMyxkPaVzTEAZVOr0tM06UMjD8SKgXgjFVf3ekOJByEniqMt9TLnq5+J/XT8C/clIWxgnQkMwG+Qk3IDLyoIwBE5QAnyiCiWBqV4OMsMAEVJwVFYK1ePIy6Tqlm3bi+qzUYRxkdoWNUQxa6RE10g1qojQh6RM/oFb1pT9qL9q59zEpLWtFziP5A+/oBmwSiUQ=</latexit> <latexit sha1_base64="FYbupf5GD4w9OLOvMRG1RFwmhU=">ACG3icbVDLSsNAFJ3UV62vqEs3wSJU0JIUQZcFNy4r2Ac0IUymk3bo5MHMjVBi/sONv+LGhSKuBf+jZM2BW09MHDmnHu59x4v5kyCaX5rpZXVtfWN8mZla3tnd0/fP+jIKBGEtknEI9HzsKSchbQNDjtxYLiwO0642vc797T4VkUXgHk5g6AR6GzGcEg5JcvWHrGYHGEaen+LMhYf5R2ZuCudWdjYXbBhRwNmpq1fNujmFsUysglRgZarf9qDiCQBDYFwLGXfMmNwUiyAEU6zip1IGmMyxkPaVzTEAZVOr0tM06UMjD8SKgXgjFVf3ekOJByEniqMt9TLnq5+J/XT8C/clIWxgnQkMwG+Qk3IDLyoIwBE5QAnyiCiWBqV4OMsMAEVJwVFYK1ePIy6Tqlm3bi+qzUYRxkdoWNUQxa6RE10g1qojQh6RM/oFb1pT9qL9q59zEpLWtFziP5A+/oBmwSiUQ=</latexit> <latexit sha1_base64="FYbupf5GD4w9OLOvMRG1RFwmhU=">ACG3icbVDLSsNAFJ3UV62vqEs3wSJU0JIUQZcFNy4r2Ac0IUymk3bo5MHMjVBi/sONv+LGhSKuBf+jZM2BW09MHDmnHu59x4v5kyCaX5rpZXVtfWN8mZla3tnd0/fP+jIKBGEtknEI9HzsKSchbQNDjtxYLiwO0642vc797T4VkUXgHk5g6AR6GzGcEg5JcvWHrGYHGEaen+LMhYf5R2ZuCudWdjYXbBhRwNmpq1fNujmFsUysglRgZarf9qDiCQBDYFwLGXfMmNwUiyAEU6zip1IGmMyxkPaVzTEAZVOr0tM06UMjD8SKgXgjFVf3ekOJByEniqMt9TLnq5+J/XT8C/clIWxgnQkMwG+Qk3IDLyoIwBE5QAnyiCiWBqV4OMsMAEVJwVFYK1ePIy6Tqlm3bi+qzUYRxkdoWNUQxa6RE10g1qojQh6RM/oFb1pT9qL9q59zEpLWtFziP5A+/oBmwSiUQ=</latexit> <latexit sha1_base64="FYbupf5GD4w9OLOvMRG1RFwmhU=">ACG3icbVDLSsNAFJ3UV62vqEs3wSJU0JIUQZcFNy4r2Ac0IUymk3bo5MHMjVBi/sONv+LGhSKuBf+jZM2BW09MHDmnHu59x4v5kyCaX5rpZXVtfWN8mZla3tnd0/fP+jIKBGEtknEI9HzsKSchbQNDjtxYLiwO0642vc797T4VkUXgHk5g6AR6GzGcEg5JcvWHrGYHGEaen+LMhYf5R2ZuCudWdjYXbBhRwNmpq1fNujmFsUysglRgZarf9qDiCQBDYFwLGXfMmNwUiyAEU6zip1IGmMyxkPaVzTEAZVOr0tM06UMjD8SKgXgjFVf3ekOJByEniqMt9TLnq5+J/XT8C/clIWxgnQkMwG+Qk3IDLyoIwBE5QAnyiCiWBqV4OMsMAEVJwVFYK1ePIy6Tqlm3bi+qzUYRxkdoWNUQxa6RE10g1qojQh6RM/oFb1pT9qL9q59zEpLWtFziP5A+/oBmwSiUQ=</latexit> Action n t , m t ∈ ℤ + a t = ( n t , m t ) n t × m t choices ~ 138,000 actions on AWS Approximating with Gaussian distribution Policy exp ( − ( a − μ ( s , θ ) 2 1 ) π ( a | s , θ ) = π ( a t | s t − 1 , θ ) 2 σ ( s , θ ) 2 σ ( s , θ ) 2 π

Reward At each epoch , t r t = − β P t , t = 1,…, T − 1 regularizer At the final epoch , T reach the expected loss value, or use up all budget a constant as the final reward/penalty

Training Maximize cumulative discounted reward: T ∑ max γ t r t , γ ∈ (0,1] t =1 discount factor Policy gradient: policy function ∇ θ 𝔽 π [ γ t r t ] = 𝔽 π [ ∇ θ ln π ( a | s , θ ) q π ( s , a )] T ∑ t =1 expected reward with and a s

DRL Reward r t Policy Agent Environment π ( a t | s t − 1 , θ ) Features Action Stateless a t Functions Policy parameters θ State s t − 1

Workflow Stateless code User- Step 2 Functions Defined package Model API Libs Cloud Step 3 Step 1 function status resource scheme Local Client action states DRL Function Scheduler Manager Agent Step 5 Step 4

• Simulation: OpenAI Gym Evaluation • Testbed: AWS Lambda + AWS EC2 • 44.3% ⬇ on job completion time

Simulation - overview • Workload : mini-batched SGD algorithms • Goal : DRL agent v.s. Grid search (# of functions)

Simulation - grid search 2000 3000 4000 5000 GS GS-300 100 10.03% Siren GS-200 Total rewards GS-100 Time (s) GS-50 0 12.87% −100 36% 0 1000 2000 100 200 300 Number of functions Budget ($)

Simulation

Simulation - DRL training Siren-300 100 1000 Number of functions Siren-200 Total rewards Siren-100 0 500 −100 Siren-300 −200 0 0 100 200 300 0 100 200 300 Epoch of ML training Iteration

Testbed • Siren on AWS Lambda v.s. MXNet on EC2 - m4.large: 2 vCPU, 8GB memory, $0.1/hr - m4.xlarge: 4 vCPU, 16GB memory, $0.2/hr - m4.2xlarge: 8 vCPU, 32GB memory, $0.4/hr • Workload - LeNet on MNIST - CNN on movie review - Linear Classification on click-through prediction dataset

Testbed - Siren and EC2 on LeNet 250 m4.large m4.2xlarge m4.xlarge Siren 200 Time (s) 150 100 50 0.08 Cost($) 0.06 0.04 0.02 2 4 6 8 Number of EC2 instances Siren

Testbed - DRL training Memory (MB) 800 100 600 Total rewards 400 200 # of functions 0 # of functions Memory 1000 −100 500 2 4 6 8 0 100 200 300 Training epoch of LeNet Iteration

Testbed - time v.s. cost EC2 400 Siren 300 Time (s) 200 100 0 0.02 0.04 0.06 0.08 Cost ($)

Testbed - given the same cost m4.2xlarge Siren 100 3000 2000 Time (s) 2000 50 1000 1000 0 0 0 LeNet CNN Linear Classfication

Conclusion • Siren: Distributed Machine Learning with a Serverless Architecture - Hybrid Synchronous Parallel (HSP) - Experience-Driven Resource Scheduler • Evaluation - Simulation & Testbed - 44.3% ⬇ on job completion time

Q&A Thank You

Distributed Machine Learning with a Serverless Architecture Hao Wang - PowerPoint PPT Presentation

INFOCOM19, Paris, FR Distributed Machine Learning with a Serverless Architecture Hao Wang 1 , Di Niu 2 , Baochun Li 1 1 University of Toronto, 2 University of Alberta What is machine learning? Deep Learning Machine Learning Numerical

Serverless On Your Own Terms Using Knative Context Serverless more than Function Serverless

How Serverless Changes the IT Department Paul Johnston Opinionated Serverless Person

Serverless Gardens IoT + Serverless johncmckim.me twitter.com/@johncmckim

Kotlin Serverless Framework Vladislav Tankov What is serverless? cloud-computing execution model

Stateful Serverless Sean Walsh @SeanWalshEsq We predict that Serverless Computing will grow

Serverless Performance on a Budget Erwin van Eyk The central trade-off in serverless computing

Databases Gone Serverless Alkin Tezuysal (@ask_dba) Sr. Technical Manager, Percona Who am I?

Lunch and Learn John McKim @johncmckim Software Engineer A Cloud Guru Serverless Framework

Distributed Machine Learning Maria-Florina Balcan Carnegie Mellon University Distributed Machine

Serverless Boom or Bust? An Analysis of Economic Incentives Xiayue Charles Lin, Joseph E.

Serverless Python Serverless Python Michael Bright , Trainer @mjbright Consulting , Trainer

Catalyst Ubers Serverless Platform Shawn Burke - Staff Engineer Uber Seattle Why Serverless?

Unikernels and Event-driven Serverless Platforms Madhuri Yechuri Agenda Bio Application

FaaS You Like It! @ewanslater Serverless CNCF Definition Serverless computing refers to

The Serverless PHP Application Rob Allen LaravelConf Taiwan 2020 Serverless? Rob Allen ~

cloudstate.io serverless 2.0 with cloudstate Sean Walsh | Field CTO and Cloud Evangelist @

PHENIX 10TH ST. CITY Pedestrian DILLINGHAM ST. Bridge Trade 1 5 T H S T . TSYS

Queues The Abstract Data Type Queue FIFO queue ADT Another common linear data structure

Direct Evidence of Student Learning Sharlene Sayegh 96 Direct Assessment SHARLENE SAYEGH

Infinite-Horizon Proactive Dynamic DCOPs Khoi Hoang Ferdinando Fioretto Ping Hou William Yeoh

Ebola virus disease Keep Safe, Keep Serving Ebola and the Academic Medical Center Response

Breast Reconstruction in the U.S. The State of Antibiotic Use in Implant Based Breast

Epigenetics 02-715 Advanced Topics in Computa8onal Genomics

Probability and Statistics for Computer Science The statement that The average US

Sambuz

Useful Links

Newsletter

Mail Us