Clusters for DNN Training Workloads Myeongjae Jeon , Shivaram - - PowerPoint PPT Presentation

clusters for dnn training workloads
SMART_READER_LITE
LIVE PREVIEW

Clusters for DNN Training Workloads Myeongjae Jeon , Shivaram - - PowerPoint PPT Presentation

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads Myeongjae Jeon , Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, Fan Yang 1 Deep Learning at a Large Enterprise Cortana Speech, Image, Ads, NLP,


slide-1
SLIDE 1

1

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, Fan Yang

slide-2
SLIDE 2

Deep Learning at a Large Enterprise

Speech, Image, Ads, NLP, Web Search … DL training jobs require large GPU clusters Philly: Cluster manager for DL workloads on large shared GPU clusters

2

Cortana

Recent Cluster Managers Optimus [EuroSys 18] Gandiva [OSDI 18] Tiresias [NSDI 19] Objective Average JCT Consolidation Average JCT Scheduler SRTF Time-sharing Gittins Index Motivated by observations in Philly

slide-3
SLIDE 3

Microsoft Philly

Significant increase in scale during 2017

10.5× in DL training jobs 5× in GPU cluster size

3

  • Resource scheduling (GPU, network)
  • Storage for data & model ckpt
  • Failure handling
  • Multi-tenancy
  • ….

Philly cluster manager

slide-4
SLIDE 4

Job Lifecycle in Philly

GPU Cluster

2

Philly Scheduler & Job Placement

Free GPU Occupied GPU 4-GPU machine

N N-GPU DL job 4 4 2

Job Queue

4 4

slide-5
SLIDE 5

Contributions

5

  • 1. First characterization study of large-scale
  • GPU clusters for DNN training
  • 2. Study cluster utilization and how effectively GPUs are used
  • 3. Present lessons for better cluster manager designs
slide-6
SLIDE 6

Contributions

6

  • 1. First characterization study of large-scale
  • GPU clusters for DNN training
  • 2. Study cluster utilization and how effectively GPUs are used
  • 3. Present lessons for better cluster manager designs

75-day period from Oct. 2017 to Dec. 2017 Total of 96,260 jobs across thousands of users

slide-7
SLIDE 7

Philly Scheduler & Job Placement

Study Details

7

GPU Cluster

2

Free GPU Occupied GPU 4-GPU machine

N N-GPU DL job 4 4 2

Queue

4

  • Scheduler logs

– Job arrival, GPU alloc, finish status

  • HW perf counters

– GPU, CPU, memory utilization

  • AI engine logs

– stderr/stdout for executed jobs

4

Track scheduling decision and utilization info during job lifecycle

slide-8
SLIDE 8

Contributions

8

  • 1. First characterization study of large-scale
  • GPU clusters for DNN training
  • 2. Study cluster utilization and how effectively GPUs are used
  • 3. Present lessons for better cluster manager designs
slide-9
SLIDE 9

9

Most GPUs in the cluster are allocated

How effectively are the GPUs utilized for DNN training?

slide-10
SLIDE 10

GPU Utilization for Job Sizes

10

64.7 59.2 51.6 44.8

1-GPU Jobs 4-GPU Jobs 8-GPU Jobs 16-GPU Jobs

Median GPU Utilization Mean

1 GPU Jobs

40 60 80 100

GPU utilization

20

4 GPU Jobs 8 GPU Jobs 16 GPU Jobs

GPU utilization is low! (Lower in distributed training) Two reasons:

  • Distribution across servers
  • Intra-server interference
slide-11
SLIDE 11

Effect of Distribution on Dedicate Servers

11

Dedicate servers → No other jobs on this server Distributed training itself causes utilization to go lower!

slide-12
SLIDE 12

Scheduling Distributed Training

12

GPU Cluster

2

Philly Scheduler & Job Placement

Free GPU Occupied GPU 4-GPU machine

N N-GPU DL job 4 2

Queue

4

  • High intra-server locality

– High communication efficiency – Long queueing time

  • Low intra-server locality

– Low queueing time – Contention in the use of network – Risk of intra-server interference (across jobs)

Relaxing locality constraints

slide-13
SLIDE 13

13

Failures occur during training

How do job failures affect cluster utilization?

slide-14
SLIDE 14

Failures Can Reduce Cluster Utilization

14

Training started Training completed

A job is unsuccessful if it repeatedly fails (waste resources)

0.33 0.98 1.09 1.11

0.2 0.4 0.6 0.8 1 1.2 1 GPU 2-4 GPU 5-8 GPU >8 GPU

Average of one failure per distributed training job

1 GPU Jobs 2-4 GPU Jobs 5-8 GPU Jobs >8 GPU Jobs

slide-15
SLIDE 15

Challenge: Failures across Stack

15

Infrastructure AI Engine User Program

Resource Scheduler

Our study: classify into failure types and identify utilization impacts Improve failure handling

slide-16
SLIDE 16

Failure Classifier

16

(signature, failure category) >230 signatures stderr/stdout Who - job & user ID GPU hours - # of GPUs x Time to failure Where - Infra? AI engine? User?

Time GPU util

slide-17
SLIDE 17

Failures in High Frequency

17

Reason: User errors in code or configuration Repetitive and appearing early

31.5 25.4 7.7 6.8 2.9 1.3

10 20 30 40 50 CPU OOM Incorrect inputs Sematic error Invalid mem access Syntax error GPU OOM

failure occurrences Mean

CPU OOM Incorrect inputs Semantic error Syntax error

20 30 40

% out of total

  • ccurrences

10

Invalid mem access GPU OOM

slide-18
SLIDE 18

Failures in High Resource Use

18

Reason: Infrastructure failures and semantics errors

24.2 17.6 16.3 15.3

10 20 30 40 50 Incorrect inputs Semantic error Model ckpt error MPI runtime failure

GPU hours until failure Mean

Incorrect inputs Semantic error

20 30 40

% out of total GPU hours

10

Model ckpt error MPI runtime failure

Spread across many layers of system stack

slide-19
SLIDE 19

Contributions

19

  • 1. First characterization study of large-scale
  • GPU clusters for DNN training
  • 2. Study cluster utilization and how effectively GPUs are used
  • 3. Present lessons for better cluster manager designs
slide-20
SLIDE 20

Locality v.s. Waiting Time

  • Users prefer lower queuing delays
  • Initial delays can outweigh giving up locality for long-running jobs

Scheduler needs to consider: 1) trade-off between queueing delay and locality-aware scheduling 2) incorporating job migration

20

Low locality High locality 24 hours 16 hours (0 hour) (1 hour) Queueing Run time

example

slide-21
SLIDE 21

Job Pre-Run before Scheduling

21

Reason: User errors in code or configuration Simple validation before scheduling (e.g., pre-run) avoids a majority of these failures

31.5 25.4 7.7 6.8 2.9 1.3

10 20 30 40 50 CPU OOM Incorrect inputs Sematic error Invalid mem access Syntax error GPU OOM

failure occurrences Mean

CPU OOM Incorrect inputs Semantic error Syntax error

20 30 40

% out of total

  • ccurrences

10

Invalid mem access GPU OOM

slide-22
SLIDE 22

More in the Paper

  • Job queueing

– Fair-share delay v.s. fragmentation delay – Impact of out-of-order scheduling on job queueing

  • Job failures

– Full classification of failures and detailed statistics – How to mitigate failures by proactively analyzing failures at runtime

  • Effectiveness of the last epochs

– Opportunity to not perform the last bunch of epochs

22

slide-23
SLIDE 23

Conclusion

  • 1. First characterization study of large-scale
  • GPU clusters for DNN training
  • 2. Inefficiencies come from multiple factors
  • 3. Lessons on locality-awareness and

failure handling

Traces available!  https://github.com/msr-fiddle/philly-traces

GPU Cluster

2

Philly Scheduler & Job Placement

4 2

Queue

4