FLEET: Flexible Efficient Ensemble Training for Heterogenous Deep - PowerPoint PPT Presentation

FLEET: Flexible Efficient Ensemble Training for Heterogenous Deep Neural Networks Hui Guan, Laxmikant Kishor Mokadam, Xipeng Shen, Seung-Hwan Lim, Robert Patton 1

Build an image classifier? Deep Neural Network (DNN) CPU GPU Training Pre-processing Storage Decoding Hyperparameters tuning: Rotation • # layers Cropping • # parameters in each layer … • Learning rate scheduling • … 2

Ensemble Training • concurrently train a set of DNNs on a cluster of nodes. Training Train model 1 Pre-processing Training Train model 2 Pre-processing Storage … … … Training Train model N Pre-processing 3

Preprocessing is redundant across the pipelines. Training Train model 1 Pre-processing Training Train model 2 Pre-processing Storage … … Training Train model N Pre-processing 4

Pittman et al., 2018 Eliminate pipeline redundancies in preprocessing through data sharing • Reduce CPU usage by 2-11X • Achieve up to 10X speedups with 15% energy consumption Pittman, Randall, et al. "Exploring flexible communications for streamlining DNN ensemble training pipelines." SC18: International Conference for High Performance Computing, Networking, Storage and Analysis . IEEE, 2018. 5

Ensemble training with data sharing CPU GPU Training Train model 1 Pre-processing Training Train model 2 … Storage … Pre-processing … Training Train model N 6

With data sharing, the training goes even slower! CPU GPU Training Train model 1 Pre-processing Training Train model 2 … Storage … Pre-processing … Training Train model N 7

CPU GPU Training Heterogenous Ensemble A set of DNNs with Pre-processing Training different architectures and configurations. … … Pre-processing Varying training rate Training Varying convergence speed 8

CPU GPU 100 images/sec Training Heterogenous Ensemble Varying training rate 100 images/sec Pre-processing Training Training rate: … compute throughput of processing units used for … Pre-processing 40 images/sec training the DNN. Training 9

CPU GPU 100 images/sec Training Heterogenous Ensemble Varying training rate 100 images/sec Pre-processing Training If a DNN consumes data … slower, other DNNs will have to wait for it before … Pre-processing 40 images/sec evicting current set of Training cached batches. Bottleneck 10

CPU GPU 40 epochs Training Heterogenous Ensemble Varying training rate 40 epochs Pre-processing Varying convergence speed Training … Due to differences in architectures and hyper- … Pre-processing parameters, some DNNs 50 epochs Training converge slower than others. 11

CPU GPU 40 epochs Resource Training under-utilized Heterogenous Ensemble Varying training rate 40 epochs Pre-processing Varying convergence speed Training A subset of DNNs have … already converged while the … Pre-processing shared preprocessing have 50 epochs Training to keep working for the remaining ones. 12

Our solution: FLEET A flexible ensemble training framework for efficiently training a heterogenous set of DNNs. heterogenous ensemble Varying training rate 1.12 – 1.92X speedup Varying convergence speed 13

Our solution: FLEET A flexible ensemble training framework for efficiently training a heterogenous set of DNNs. heterogenous ensemble Varying training rate 1.12 – 1.92X speedup Varying convergence speed Contributions: 1. Optimal resource allocation 14

Our solution: FLEET A flexible ensemble training framework for efficiently training a heterogenous set of DNNs. heterogenous ensemble Data-parallel distributed training Varying training rate Varying convergence speed Checkpointing Contributions: 1. Optimal resource allocation 2. Greedy allocation algorithm 15

Our solution: FLEET A flexible ensemble training framework for efficiently training a heterogenous set of DNNs. heterogenous ensemble Data-parallel distributed training Varying training rate Varying convergence speed Checkpointing Contributions: 1. Optimal resource allocation 2. Greedy allocation algorithm 3. A set of techniques to solve challenges in implementing FLEET 16

Focus of This Talk A flexible ensemble training framework for efficiently training a heterogenous set of DNNs. heterogenous ensemble Data-parallel distributed training Varying training rate Varying convergence speed Checkpointing Contributions: 1. Optimal resource allocation 2. Greedy allocation algorithm 3. A set of techniques to solve challenges in implementing FLEET 17

Resource Allocation Problem CPU GPU Training DNN 1 Pre-processing What is an Training optimal DNN 2 GPU Pre-processing allocation ? Optimal CPU allocation : … Set #processes for preprocessing Training to be the one that just meet the DNN N computing requirements of training DNNs 18

GPU Allocation DNN 1 GPU Node 1 GPU DNN 2 DNN 3 GPU Node 2 GPU DNN 4 19

GPU Allocation: 1 GPU to 1 DNN Training rate DNN 1 100 images/sec GPU Node 1 GPU DNN 2 80 images/sec 80 images/sec DNN 3 GPU Node 2 DNN 4 40 images/sec GPU Training rate of the pipeline With data sharing, the slowest DNN determines the training rate of the ensemble training pipeline. 20

GPU Allocation: Different GPUs to Different DNNs Training rate DNN 1 GPU 100 images/sec Node 1 GPU DNN 4 105 images/sec DNN 4 GPU Reduce waiting time Node 2 Increase utilization GPU DNN 4 Training rate of the pipeline Another way to allocate GPUs: only DNN 1 and DNN 4 are trained together with data sharing. 21

GPU Allocation: Different GPUs to Different DNNs Training rate DNN 1 GPU 100 images/sec Node 1 GPU DNN 4 105 images/sec DNN 4 GPU Node 2 GPU DNN 4 Flotilla A set of DNNs to be trained together with data sharing (e.g., DNN1 and DNN4 ). 22

GPU Allocation: Different GPUs to Different DNNs Training rate DNN 1 DNN 2 GPU Node 1 GPU DNN 4 DNN 2 … … DNN 4 DNN 3 GPU Node 2 GPU DNN 3 DNN 4 Flotilla 2 Flotilla 1 We need to create a list of flotillas to train all DNNs to converge. 23

Optimal Resource Allocation Given a set of DNN to train and a cluster of nodes, find: (1) the list of flotillas and (2) GPU assignments within each flotilla such that the end-to-end ensemble training time is minimized. NP-hard 24

Greedy Allocation Algorithm Dynamically determine the list of flotillas : (1) whether a DNN is converged or not, (2) the training rate of each DNN. Once a flotilla is created, derive an optimal GPU assignment 25

Greedy Allocation Algorithm DNN ensemble Profile training rates of each DNN on m GPUs Create a new flotilla Assign GPUs for DNNs in the flotilla Train DNNs in the flotilla with data sharing Converged DNNs 26

Greedy Allocation Algorithm: DNN ensemble profiling Profile training rates of each DNN on m GPUs Training rates (images/sec) of DNNs on GPUs. # GPU 1 2 3 4 Create a new flotilla DNN 1 100 190 270 350 DNN 2 80 150 220 280 Assign GPUs for DNNs in the flotilla DNN 3 80 150 200 240 DNN 4 40 75 105 120 Train DNNs in the flotilla with data sharing Converged DNNs 27

Greedy Allocation Algorithm DNN ensemble Profile training rates of each DNN on m GPUs Create a new flotilla Step 1: Flotilla Creation Assign GPUs for DNNs in Step 2: GPU Assignment the flotilla Train DNNs in the flotilla Step 3: Model training with data sharing Converged DNNs 28

Step 1: Flotilla Creation DNN ensemble #1: DNNs in the same flotilla should be able to Profile training rates reach a similar training rate if a proper number of of each DNN on m GPUs GPUs is assigned to each DNN. Create a new flotilla Reduce GPU waiting time #2: Pack into one flotilla as many DNNs as possible. Assign GPUs for DNNs in the flotilla Avoid inefficiency due to sublinear scaling Allow more DNNs to share preprocessing Train DNNs in the flotilla with data sharing Converged DNNs 29

Step 1: Flotilla Creation DNN ensemble Profile training rates # GPUs available: 4-1 à 3 – 3 à 0 of each DNN on m GPUs # GPU 1 2 3 4 DNN 1 100 190 270 350 Create a new flotilla DNN 2 80 150 220 280 DNN 3 80 150 200 240 Assign GPUs for DNNs in the flotilla DNN 4 40 75 105 120 Train DNNs in the flotilla DNN 1 DNN 4 with data sharing #GPU=3 #GPU=1 Converged DNNs 30

Step 2: GPU Assignment DNN ensemble Profile training rates #1: When assigning multiple GPUs to a DNN, of each DNN on m GPUs try to use GPUs in the same node. #2: Try to assign DNNs that need a smaller Create a new flotilla number of GPUs to the same node. Assign GPUs for DNNs in Reduce the variation in communication latency the flotilla Node 2 Node 1 Train DNNs in the flotilla GPU GPU GPU GPU with data sharing DNN 4 DNN 1 DNN 4 DNN 4 Converged DNNs 31

Step 1: Flotilla Creation DNN ensemble Step 2: GPU Assignment Profile training rates of each DNN on m GPUs Data-parallel distributed training Create a new flotilla Assign GPUs for DNNs in Varying training rate the flotilla Train DNNs in the flotilla with data sharing Converged DNNs 32

Step 3: Model Training DNN ensemble Profile training rates of each DNN on m GPUs Checkpointing Create a new flotilla Assign GPUs for DNNs in Varying convergence speed the flotilla Train DNNs in the flotilla with data sharing Converged DNNs 33

FLEET: Flexible Efficient Ensemble Training for Heterogenous Deep - PowerPoint PPT Presentation

FLEET: Flexible Efficient Ensemble Training for Heterogenous Deep Neural Networks Hui Guan, Laxmikant Kishor Mokadam, Xipeng Shen, Seung-Hwan Lim, Robert Patton 1 Build an image classifier? Deep Neural Network (DNN) CPU GPU Training

De-carbonizing Fleet NYC Fleet Keith T . Kerman, NYC Chief Fleet Officer NY Energy Summit,

NYC Fleet Keith T. Kerman, NYC Chief Fleet Officer Montreal, Canada June 3, 2019 NYC Fleet: Who

Civil Service 101 Presentation NYC Fleet June 18, 2020 NYC Fleet NYC Fleet NYC Fleet:

2020 FLEET MANAGEMENT PLAN PRESENTATION TO GCTD BOARD OF DIRECTORS March 4, 2020 Fleet

GSA Fleet GSA Fleet Overview Summer RTOC Meeting Indio, CA. July 30, 2014 Chris Bross

P P at Mar at Mar chant chant Cessna Mustang Cessna Mustang 2015 2015 Mustang Fleet

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

FLEET COMMANDER THE EFFICIENT WAY OF MANAGING THE DESKTOP PROFILES OF YOUR FLEET! FABIANO

Pima County Fleet Services Electrification of County Fleet March 2, 2020 Pima County has set a

The The Beverly Beverly Middle Middle School School Flexible Flexible Learning Learning

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Personalized Learning Flexible Seating and Space Flexible Seating and Space Flexible Seating and

Fleet Days FY2020 Atlanta / Albany State / Savannah Technical College WEX Online / WEX ClearView

Calgary Transit Greening the Fleet Russell Davies Manager, Transit Fleet 1 June 13, 2016 |

Rubber Tire Fleet Overview Rubber Tire Fleet Overview Guiding Principles Guiding Principles

CAP Fleet Services Optimization Strategy Fleet Services Division n Transferred to Centralized

SEC Comments Summary 2013 Disclaimer The information conveyed in the following presentation

Math 3230 Abstract Algebra I Sec 3.1: Subgroups Slides created by M. Macauley, Clemson (Modified

Why cold homes and energy use are issues for health and care systems and society systemsand

Reducing the Search Space of Resource Constrained DCOPs T. Matsui 1) , M. Silaghi 2) , K.

MATH 12002 - CALCULUS I 2.3, 2.4, and 2.5: Computing Derivatives (Part 2) Professor

The Enhanced Role of the U.S. Securities and Exchange Commission: An analysis of investigations

Off by Default! Hitesh Ballani, Yatin Chawathe, Sylvia Ratnasamy, Timothy Roscoe, Scott Shenker

Hierarchical Task Network (HTN) Planning Section 12.2 Sec. 12.2 p.1/23 Outline Primitive

Sambuz

Useful Links

Newsletter

Mail Us