Cloud Deep Learning Gingfung Yeung, Damian Borowiec, Adrian Friday, - - PowerPoint PPT Presentation

cloud deep learning
SMART_READER_LITE
LIVE PREVIEW

Cloud Deep Learning Gingfung Yeung, Damian Borowiec, Adrian Friday, - - PowerPoint PPT Presentation

2020 USENIX HotCloud Towards GPU Utilization Prediction for Cloud Deep Learning Gingfung Yeung, Damian Borowiec, Adrian Friday, Richard Harper, Peter Garraghan Evolving Distributed System Lab School of Computing & Communications Lancaster


slide-1
SLIDE 1

Towards GPU Utilization Prediction for Cloud Deep Learning

Gingfung Yeung, Damian Borowiec, Adrian Friday, Richard Harper, Peter Garraghan

Evolving Distributed System Lab School of Computing & Communications Lancaster University UK

2020 USENIX HotCloud

slide-2
SLIDE 2

2

Deep Learning (DL) Systems

More Deep Learning (DL) workloads Growing number of expensive GPUs Machine Learning engineers, researchers, users

Require efficient resource usage & high DL performance

slide-3
SLIDE 3

3

DL System Challenges

DL System Challenges

  • Avg. GPU utilization

~ 52% in production systems [Jeon et al. ’19]

  • Long job completion + queue times

~ up to hours [Jeon et al. ’19; Gu et al. ‘19]

3

Addressed via understanding and exploiting workload patterns

slide-4
SLIDE 4

4

Online profiling approach

GPU-1 GPU-2 Workload

Resource Monitor Node

Response Profile

GPU-1 {Utilization = 20, Memory = 4GiB,Bytes…} GPU-2 {Utilization = 40, Memory = 6GiB,Bytes…}

Workload

Deploy workload into isolated machines and GPUs to obtain workload patterns Usually per workload profiling range from minutes to hours

slide-5
SLIDE 5

5

  • Iteration time
  • Useful for scale-out workers, migration, SLA-aware inference
  • [Peng et al. ’18; Xiao et al.’ 18; Shen et al.’ 19]
  • Network I/O
  • Useful for efficient distributed training
  • [Gu et al. ’19]
  • GPU Utilization
  • For packing and calculating interference
  • [Thinakaran et al. ’19; Xu et al. ’19]

DL Metrics

slide-6
SLIDE 6

6

Case: Scheduling

Resource Monitor Scheduler

Resource Management Framework

6

  • 1. Query
  • 2. Issue
  • 3. Migrate

Make decision based on workload patterns from profiling

Scheduling Loop

slide-7
SLIDE 7

7

Time is Money

  • N workload × mins

… …

Workload Queue

Profiling Stage

(mins)

Scheduling Stage

If the system has many heterogenous workloads, will lead to head-of-line blocking.

slide-8
SLIDE 8

8

Online Profiling

  • Pros
  • Accurate, near real-time workload patterns
  • Provide insights to the system
  • Cons
  • Heterogenous workloads require different profiles
  • Time consuming (~mins to ~hours)
  • Require modifying underlying frameworks
slide-9
SLIDE 9
  • Pros
  • Accurate, near real-time workload patterns
  • Provide insights to the system
  • Cons
  • Heterogenous workloads require different profiles
  • Time consuming (~mins to ~hours)
  • Require actual execution onto an isolated machine
  • Require modifying underlying frameworks

9

Online Profiling

Obtain prior execution ?

slide-10
SLIDE 10

10

Prediction

  • N workload × seconds

… …

Workload Queue

Prediction Stage

(sub-second – seconds)

Scheduling Stage

Reduce blocking

slide-11
SLIDE 11

11

DL System Challenges

DL System Challenges

  • Avg. GPU utilization

~ 52% in production systems [Jeon et al. ’19]

  • Long job completion + queue times

~ up to hours [Jeon et al. ’19; Gu et al. ‘19]

11

Addressed via understanding and exploiting workload patterns

slide-12
SLIDE 12

12

  • Iteration time
  • Useful for scale-out workers, migration, SLA-aware inference
  • [Peng et al. ’18; Xiao et al.’ 18; Shen et al.’ 19]
  • Network I/O
  • Useful for efficient distributed training
  • [Gu et al. ’19]
  • GPU Utilization
  • For packing and calculating interference
  • [Thinakaran et al. ’19; Xu et al. ’19]

DL Metrics

slide-13
SLIDE 13

13

Objective

Benefits

  • Estimates GPU utilization of unseen workloads
  • Prior to execution
  • No modification of existing DL frameworks
  • E.g. PyTorch, TensorFlow, MXNet…

13

GPU utilization prediction engine for Cloud DL Systems

Analysis, prediction model, case study

slide-14
SLIDE 14

14

Going deeper with convolutions [Szegedy et al 2014]

Leverage graph information to predict workload usage.

𝑔 𝑦 → 𝑧 Features: Num. Convs, FLOPs, layers, etc. (See paper for full features list)

DL computation graph

slide-15
SLIDE 15

15

Analysis

15

  • Profile DL workload utilization
  • Determine important model features
  • Set up
  • Nvidia 1080, Nvidia 2080, Intel i7-6850k
  • 13 DNN model architectures, 81 workloads
  • Tools
  • Nvidia-smi
  • Nvidia Nsight Systems

See paper for full list of models and permutations.

slide-16
SLIDE 16

CNN RNN

20 40 60 80 100

GPU Utilization %

16

Analysis

16

GFLOPs GPU Utilization %

slide-17
SLIDE 17

Analysis

17

1x 2x 3x 4x 5x 50 100 150 200

Normalized JCT increase Summative GPU Utilization (%)

20 40 60 80 100

GPU Utilization %

Nvidia 1080 Nvidia 2080 Batch 16 Batch 64 Batch 128

1.5x – 4x slowdown from co-location

slide-18
SLIDE 18

18

GPU Utilization Prediction

1 𝑜 ෍

𝑗=1 𝑜

log 𝑞𝑗 + 1 − log 𝑧𝑗 + 1

2

slide-19
SLIDE 19

19

Evaluation

20 40 60 80 100

50 100 150 200 250 300

Avg Cluster GPU Utilization (%)

Time (minutes) Slot-based Reactive Proactive

33.5% Makespan reduction 61.5% Utilization improvements

slide-20
SLIDE 20

20

Open Challenges

  • Hardware
  • Number of processing elements, memory bandwidth and cache

sizes.

  • DL Compilers
  • Extract lower level IR to determine optimization decision for

more accurate prediction. (e.g. Op fusion – ConvBatchNorm)

  • Distributed Workload
  • Network I/O, parallelism strategy and system configuration.
  • (e.g. ring topology)
  • Co-location Scheduling
  • Incorporate prediction and system constraints
  • Derive an optimization algorithm
  • (e.g. Mixed Integer Programming).