FLEET: Flexible Efficient Ensemble Training for Heterogenous Deep - - PowerPoint PPT Presentation

fleet flexible efficient ensemble training for
SMART_READER_LITE
LIVE PREVIEW

FLEET: Flexible Efficient Ensemble Training for Heterogenous Deep - - PowerPoint PPT Presentation

FLEET: Flexible Efficient Ensemble Training for Heterogenous Deep Neural Networks Hui Guan, Laxmikant Kishor Mokadam, Xipeng Shen, Seung-Hwan Lim, Robert Patton 1 Build an image classifier? Deep Neural Network (DNN) CPU GPU Training


slide-1
SLIDE 1

FLEET: Flexible Efficient Ensemble Training for Heterogenous Deep Neural Networks

Hui Guan, Laxmikant Kishor Mokadam, Xipeng Shen, Seung-Hwan Lim, Robert Patton

1

slide-2
SLIDE 2

2

Build an image classifier?

Pre-processing Storage Training

Decoding Rotation Cropping … Hyperparameters tuning:

  • # layers
  • # parameters in each layer
  • Learning rate scheduling

Deep Neural Network (DNN) CPU GPU

slide-3
SLIDE 3

3

Pre-processing

Train model 1 Ensemble Training

  • concurrently train a set of DNNs on a cluster of nodes.

Training Storage

Train model 2 Train model N

Pre-processing Training Pre-processing Training

… …

slide-4
SLIDE 4

4

Pre-processing

Train model 1 …

Pre-processing Storage

Train model 2

Pre-processing

Train model N Preprocessing is redundant across the pipelines.

Training Training Training

slide-5
SLIDE 5

5

Pittman, Randall, et al. "Exploring flexible communications for streamlining DNN ensemble training pipelines." SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018.

Eliminate pipeline redundancies in preprocessing through data sharing

  • Reduce CPU usage by 2-11X
  • Achieve up to 10X speedups with 15% energy consumption

Pittman et al., 2018

slide-6
SLIDE 6

6

Train model 1 … Train model 2 Train model N

Training Training Training

Ensemble training with data sharing

Pre-processing Storage Pre-processing

… CPU GPU

slide-7
SLIDE 7

7

Train model 1

Train model 2 Train model N With data sharing, the training goes even slower!

Training Training Training

Pre-processing Storage Pre-processing

… CPU GPU

slide-8
SLIDE 8

8

Heterogenous Ensemble A set of DNNs with different architectures and configurations. Varying training rate Varying convergence speed

Training Training Training

Pre-processing Pre-processing

… CPU GPU

slide-9
SLIDE 9

Training Training Training

Pre-processing Pre-processing

9

CPU GPU

Varying training rate Training rate: compute throughput of processing units used for training the DNN.

40 images/sec 100 images/sec 100 images/sec

Heterogenous Ensemble

slide-10
SLIDE 10

Pre-processing Training Training Training

Pre-processing

10

CPU GPU

Varying training rate If a DNN consumes data slower, other DNNs will have to wait for it before evicting current set of cached batches.

40 images/sec 100 images/sec 100 images/sec Bottleneck

Heterogenous Ensemble

slide-11
SLIDE 11

Pre-processing Training Training Training

Pre-processing

11

CPU GPU 50 epochs 40 epochs 40 epochs

Varying convergence speed Due to differences in architectures and hyper- parameters, some DNNs converge slower than others. Varying training rate Heterogenous Ensemble

slide-12
SLIDE 12

Pre-processing Training Training Training

Pre-processing

12

CPU GPU 50 epochs 40 epochs 40 epochs Resource under-utilized

Varying convergence speed A subset of DNNs have already converged while the shared preprocessing have to keep working for the remaining ones. Varying training rate Heterogenous Ensemble

slide-13
SLIDE 13

13

A flexible ensemble training framework for efficiently training a heterogenous set of DNNs.

Our solution: FLEET

Varying convergence speed Varying training rate heterogenous ensemble 1.12 – 1.92X speedup

slide-14
SLIDE 14

14

A flexible ensemble training framework for efficiently training a heterogenous set of DNNs.

Our solution: FLEET

Varying convergence speed Varying training rate heterogenous ensemble Contributions:

  • 1. Optimal resource allocation

1.12 – 1.92X speedup

slide-15
SLIDE 15

15

A flexible ensemble training framework for efficiently training a heterogenous set of DNNs.

Our solution: FLEET

Varying convergence speed Varying training rate heterogenous ensemble Data-parallel distributed training Checkpointing Contributions:

  • 1. Optimal resource allocation
  • 2. Greedy allocation algorithm
slide-16
SLIDE 16

16

A flexible ensemble training framework for efficiently training a heterogenous set of DNNs.

Our solution: FLEET

Varying convergence speed Varying training rate heterogenous ensemble Data-parallel distributed training Checkpointing Contributions:

  • 1. Optimal resource allocation
  • 2. Greedy allocation algorithm
  • 3. A set of techniques to solve challenges in implementing FLEET
slide-17
SLIDE 17

17

A flexible ensemble training framework for efficiently training a heterogenous set of DNNs.

Focus of This Talk

Varying convergence speed Varying training rate heterogenous ensemble Data-parallel distributed training Checkpointing Contributions:

  • 1. Optimal resource allocation
  • 2. Greedy allocation algorithm
  • 3. A set of techniques to solve challenges in implementing FLEET
slide-18
SLIDE 18

18

Pre-processing

CPU GPU

Pre-processing Training Training Training

Resource Allocation Problem

DNN 2 DNN 1 DNN N

Optimal CPU allocation: Set #processes for preprocessing to be the one that just meet the computing requirements of training DNNs What is an

  • ptimal

GPU allocation?

slide-19
SLIDE 19

19

GPU Allocation

DNN 1 DNN 2 DNN 3 DNN 4

GPU GPU GPU GPU

Node 1 Node 2

slide-20
SLIDE 20

20

DNN 1 DNN 2 DNN 3 DNN 4

GPU GPU GPU GPU

Node 1 Node 2 100 images/sec 80 images/sec 80 images/sec 40 images/sec With data sharing, the slowest DNN determines the training rate of the ensemble training pipeline. Training rate of the pipeline Training rate

GPU Allocation: 1 GPU to 1 DNN

slide-21
SLIDE 21

21

DNN 1 DNN 4

GPU GPU GPU GPU

Node 1 Node 2 100 images/sec DNN 4 DNN 4 105 images/sec Training rate Another way to allocate GPUs: only DNN 1 and DNN 4 are trained together with data sharing. Training rate of the pipeline Reduce waiting time Increase utilization

GPU Allocation: Different GPUs to Different DNNs

slide-22
SLIDE 22

22

DNN 1 DNN 4

GPU GPU GPU GPU

Node 1 Node 2 100 images/sec DNN 4 DNN 4 105 images/sec Training rate A set of DNNs to be trained together with data sharing (e.g., DNN1 and DNN4 ). Flotilla

GPU Allocation: Different GPUs to Different DNNs

slide-23
SLIDE 23

23

DNN 1 DNN 4

GPU GPU GPU GPU

Node 1 Node 2 DNN 4 DNN 4 Training rate We need to create a list of flotillas to train all DNNs to converge. DNN 2 DNN 3 DNN 3 DNN 2

Flotilla 1 Flotilla 2 … …

GPU Allocation: Different GPUs to Different DNNs

slide-24
SLIDE 24

24

Given a set of DNN to train and a cluster of nodes, find: (1) the list of flotillas and (2) GPU assignments within each flotilla such that the end-to-end ensemble training time is minimized. NP-hard

Optimal Resource Allocation

slide-25
SLIDE 25

25

Greedy Allocation Algorithm

Dynamically determine the list of flotillas: (1) whether a DNN is converged or not, (2) the training rate of each DNN. Once a flotilla is created, derive an optimal GPU assignment

slide-26
SLIDE 26

26

Create a new flotilla Assign GPUs for DNNs in the flotilla Train DNNs in the flotilla with data sharing Profile training rates

  • f each DNN on m GPUs

Converged DNNs DNN ensemble

Greedy Allocation Algorithm

slide-27
SLIDE 27

27

Create a new flotilla Assign GPUs for DNNs in the flotilla Train DNNs in the flotilla with data sharing Profile training rates

  • f each DNN on m GPUs

# GPU 1 2 3 4 DNN 1 100 190 270 350 DNN 2 80 150 220 280 DNN 3 80 150 200 240 DNN 4 40 75 105 120 Converged DNNs DNN ensemble

Greedy Allocation Algorithm: profiling

Training rates (images/sec) of DNNs on GPUs.

slide-28
SLIDE 28

28

Create a new flotilla Assign GPUs for DNNs in the flotilla Train DNNs in the flotilla with data sharing Profile training rates

  • f each DNN on m GPUs

Converged DNNs DNN ensemble

Greedy Allocation Algorithm

Step 1: Flotilla Creation Step 2: GPU Assignment Step 3: Model training

slide-29
SLIDE 29

29

Create a new flotilla Assign GPUs for DNNs in the flotilla Train DNNs in the flotilla with data sharing Profile training rates

  • f each DNN on m GPUs

Converged DNNs DNN ensemble #1: DNNs in the same flotilla should be able to reach a similar training rate if a proper number of GPUs is assigned to each DNN. #2: Pack into one flotilla as many DNNs as possible.

Step 1: Flotilla Creation

Reduce GPU waiting time Avoid inefficiency due to sublinear scaling Allow more DNNs to share preprocessing

slide-30
SLIDE 30

30

Create a new flotilla Assign GPUs for DNNs in the flotilla Train DNNs in the flotilla with data sharing Profile training rates

  • f each DNN on m GPUs

Converged DNNs DNN ensemble # GPU 1 2 3 4 DNN 1 100 190 270 350 DNN 2 80 150 220 280 DNN 3 80 150 200 240 DNN 4 40 75 105 120 DNN 1 #GPU=1 DNN 4 #GPU=3 # GPUs available: 4-1 à 3 – 3 à 0

Step 1: Flotilla Creation

slide-31
SLIDE 31

31

Create a new flotilla Assign GPUs for DNNs in the flotilla Train DNNs in the flotilla with data sharing Profile training rates

  • f each DNN on m GPUs

Converged DNNs DNN ensemble #1: When assigning multiple GPUs to a DNN, try to use GPUs in the same node. #2: Try to assign DNNs that need a smaller number of GPUs to the same node.

Step 2: GPU Assignment

Reduce the variation in communication latency GPU GPU Node 1 Node 2 GPU GPU DNN 1 DNN 4 DNN 4 DNN 4

slide-32
SLIDE 32

32

Create a new flotilla Assign GPUs for DNNs in the flotilla Train DNNs in the flotilla with data sharing Profile training rates

  • f each DNN on m GPUs

Converged DNNs DNN ensemble

Step 1: Flotilla Creation Step 2: GPU Assignment

Varying training rate Data-parallel distributed training

slide-33
SLIDE 33

33

Create a new flotilla Assign GPUs for DNNs in the flotilla Train DNNs in the flotilla with data sharing Profile training rates

  • f each DNN on m GPUs

Converged DNNs DNN ensemble

Step 3: Model Training

Varying convergence speed Checkpointing

slide-34
SLIDE 34

34

Create a new flotilla Assign GPUs for DNNs in the flotilla Train DNNs in the flotilla with data sharing Profile training rates

  • f each DNN on m GPUs

Converged DNNs DNN ensemble # GPU 1 2 3 4 DNN 1 100 190 270 350 DNN 2 80 150 220 280 DNN 3 80 150 200 240 DNN 4 40 75 100 120 Once converged, mark as complete and release GPUs.

Step 3: Model Training

Stop training the flotilla once less than 80%

  • f GPUs remain active for training.
slide-35
SLIDE 35

35

Create a new flotilla Assign GPUs for DNNs in the flotilla Train DNNs in the flotilla with data sharing Profile training rates

  • f each DNN on m GPUs

Converged DNNs DNN ensemble # GPU 1 2 3 4 DNN 1 100 190 270 350 DNN 2 80 150 220 280 DNN 3 80 150 200 240 DNN 4 40 75 100 120

Consider only un-converged DNNs when create the next flotilla.

Step 3: Model Training

slide-36
SLIDE 36

36

Experiment Settings

  • Heterogenous ensemble
  • 100 DNNs derived from DenseNets and ResNets
  • Training rate on a single GPU: 21~176 images/sec.
  • Summit-Dev@ORNL
  • 2 IBM POWER8 CPUs with 256GB DRAM
  • 4 NVIDIA Tesla P100 GPUs
  • Dataset
  • Caltech256: 30K training images (240 minutes limit)
slide-37
SLIDE 37

37

Counterparts for Comparisons

  • Baseline
  • Train each DNN on one GPU independently
  • Randomly picks one yet-to-be-trained DNN whenever a GPU is free
  • Homogeneous Training (Pittman et al., 2018)
  • Train each DNN on one GPU with data sharing
  • When #GPUs < #DNNs, randomly picks a subset of DNNs to train

after the previous subset is done

  • FLEET-G (global paradigm)
  • FLEET-L (local paradigm)
  • Train remaining DNNs once some GPUs are released
  • Pick the DNNs to train by the greedy algorithm in FLEET-G
slide-38
SLIDE 38

38

End-to-End Speedups

0.5 1 1.5 2 2.5 (20,100) (40,100) (60,100) (80,100) (100,100) (120,100) (140,100) (160,100)

Speedup (#GPUs, #DNNs) Homogeneous[24] FLEET-L FLEET-G Homogeneous FLEET-L

Homogeneous: slowdowns are due to the waiting of other GPUs for the slowest DNN to finish.

slide-39
SLIDE 39

39

End-to-End Speedups

0.5 1 1.5 2 2.5 (20,100) (40,100) (60,100) (80,100) (100,100) (120,100) (140,100) (160,100)

Speedup (#GPUs, #DNNs) Homogeneous[24] FLEET-L FLEET-G Homogeneous FLEET-L

FLEET-G: the best overall performance, 1.12-1.92X speedups over the baseline. FLEET-L: notable but smaller speedups for less favorable allocation decisions. The overhead of scheduling and checkpointing is at most 0.1% and 6.3%

  • f the end-to-end training time in all the settings.
slide-40
SLIDE 40

40

Conclusions and Future Work

  • Systematically explore the strategies for flexible ensemble training for

a heterogenous set of DNNs.

  • Optimal resource allocation à Greedy allocation algorithm
  • Software implementation
  • Data-parallel distributed training, dynamic GPU-DNN

mappings, checkpointing, data sharing

  • Future work: apply FLEET to real hyperparameter tuning and neural

architecture search workloads.

slide-41
SLIDE 41

41