5194.01: Introduction to High-Performance Deep Learning - - PowerPoint PPT Presentation

5194 01 introduction to high performance deep learning
SMART_READER_LITE
LIVE PREVIEW

5194.01: Introduction to High-Performance Deep Learning - - PowerPoint PPT Presentation

. . . . . . . . . . . . . . 5194.01: Introduction to High-Performance Deep Learning Mesh-TensorFlow & SparkNet Shen Wang The Ohio State University 10/21/2020 Shen Wang (The Ohio State University) 5194.01: Introduction to


slide-1
SLIDE 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5194.01: Introduction to High-Performance Deep Learning

Mesh-TensorFlow & SparkNet Shen Wang

The Ohio State University

10/21/2020

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 1 / 33

slide-2
SLIDE 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

SparkNet & Mesh-TensorFlow

SPARKNET: TRAINING DEEP NETWORKS IN SPARK Mesh-TensorFlow: Deep Learning for Supercomputers

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 2 / 33

slide-3
SLIDE 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 3 / 33

slide-4
SLIDE 4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Background

Training DNN is time-consuming Using computational cluster to speed up training

Many attempts to speed up the training of deep networks rely on asynchronous, lock-free optimization. Batch-processing frameworks become popular. However, state-of-the-art deep learning systems rely on custom implementations to facilitate their asynchronous, communication-intensive workloads.

SparkNet is designed to integrate distributed training algorithm with existing batch computational frameworks such as MapReduce and Spark.

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 4 / 33

slide-5
SLIDE 5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Architecutre of SparkNet, Parameter server model

Master node: keep the latest model parameters and serve them to worker nodes Worker nodes: Compute gradients with respect to the parameters and ship them back to master nodes

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 5 / 33

slide-6
SLIDE 6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Advantages

It is convenient to integrate model training with the existing data-processing pipelines. Allows data to be kept in memory from start to fjnish, train and visualize within single framework *Hardware requirements are minimal

Many distributed training approaches requires heavy communication. SparkNet dose not require optimization for communication in within cluster

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 6 / 33

slide-7
SLIDE 7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Implementation: distributed training

Master node broadcasts parameters to worker nodes Worker nodes train on batches individually for 50 iterations and ship back the parameters. Master node update parameters with the average of worker nodes and broadcast the new parameters.

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 7 / 33

slide-8
SLIDE 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Theroretical limitations for difgerent parallelization schemes

No parallelization

Na(b): number of serial iterations of SGD required to obtain an accuracy of a when train- ing with a batch size of b C(b): time for computing the gradient over a batch of size b Total time: T0 = Na(b) × C(b) Each block corresponds to a single SGD update with batch size b

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 8 / 33

slide-9
SLIDE 9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Theroretical limitations for difgerent parallelization schemes

Naive parallelization

Distribute the computation by dividing minibatch for K machines. Time for single node in a single iteration becomes C(b/K) Broadcasting parameters takes time S Total time: T1 = Na(b) × (C(b/K) + S) Speed up rate T0

T1 = C(b)Na(b) (C(b/K)+S)∗Na(b) < C(b) S

(limitation)

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 9 / 33

slide-10
SLIDE 10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Theroretical limitations for difgerent parallelization schemes

SparkNet parallelization

Distribute the computation in rounds for K machines. In a round, each machine runs SGD for τ iterations with batch size b Ma(b, K, τ): number of rounds required to achieve an accuracy of a. Time for single node in a single iteration is still C(b) Broadcasting parameters takes time S Total time: T2 = Ma(b, K, τ) × (τ ∗ C(b) + S) Speed up rate T0

T2 = C(b)Na(b) (τ∗C(b)+S)∗Ma(b,K,τ)

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 10 / 33

slide-11
SLIDE 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Theroretical limitations for SparkNet parallelization

Speed up rate T0

T2 = C(b)Na(b) (τ∗C(b)+S)∗Ma(b,K,τ)

Disregard the overhead due to synchronization

speed up rate

Na(b) τ∗Ma(b,K,τ)

run SparkNet using a modifjed version of AlexNet on a subset of ImageNet (fjrst 100 classes each with approximately 1000 data)

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 11 / 33

slide-12
SLIDE 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Speed up rate disregarding communication

K = 1 case: only one worker, τ will not have efgects. τ = 1 case: equivalent to running serial SGD with batch size of K ∗ b Same K: The speed up does not increase as τ decrease. (surprising)

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 12 / 33

slide-13
SLIDE 13

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Speed up with consideration of communication

τ = 1, 2, 5, 10, 25, 100, 500, 1000, 2500 Naive parallelization gives no speed up when communication overhead is large However, SparkNet can give relatively consistent speedup when communication overhead is quite large.

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 13 / 33

slide-14
SLIDE 14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Training benchmarks

Figure: Performance with AlexNet(left) and GoogLeNet(right) on ImageNet dataset

Train the default Cafge model of AlexNet on ImageNet dataset.

Compare the wall-clock time required to obtain an accuracy of 45%

Train the default Cafge model of GoogLeNet on ImageNet dataset

Compare the wall-clock time required to obtain an accuracy of 40%

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 14 / 33

slide-15
SLIDE 15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dependence of Parallelization scheme on τ

Each experiment was run with K = 5 worker

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 15 / 33

slide-16
SLIDE 16

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conclusion

easy-to-use deep learning implementation for Spark that is based on Cafge and enables the easy parallelization of existing Cafge models with minimal modifjcation Their approach is efgective even in highly bandwidth-limited settings.

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 16 / 33

slide-17
SLIDE 17

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 17 / 33

slide-18
SLIDE 18

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Background

Batch-splitting (data parallelism) is dominant training strategy for distributed DNN

Inability to train very large models (memory constraints) High latency and ineffjciency at small batch size

Model-parallelism can solved the problems of batch-splitting

Complicated to specify the distribution strategies Diffjcult to compile and optimize

Mesh-TensorFlow: a language for specifying a general class of distributed tensor computations

User can specify any tensor dimensions to be split across any dimensions of a multi dimensional mesh of processors

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 18 / 33

slide-19
SLIDE 19

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Hardware assumptions

Clusters of identical and reliable processors Defjne a mesh as an n-dimensional array of processors Take a 512-core TPU cluster as an example

could be represented by 3-dimensional mesh with shape [16,16,2] could be represented by 2-dimensional mesh with shape [32,16] could also be represented by 1-dimensional mesh with shape [512]

Physical network topology of a cluster does not afgect how Mesh is defjned and the performance.

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 19 / 33

slide-20
SLIDE 20

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Single-Program-Multiple-Data (SPMD) Batch-Splitting

Inspired by a synchronous data-parallel scheme (similar to parameter server model )

Each processor keeps an identical copy of all parameters The batch is split in sub-batches and feed into each processor Each processor computes the forward and backward. Sum the gradient for each sub-batch and update the parameters for all processors

We can describe this as splitting the computation across the ”batch” dimension. Mesh-TensorFlow generalizes this idea to splitting computations across arbitrary dimensions

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 20 / 33

slide-21
SLIDE 21

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Mesh shape and Computation Layout

A global ”computation layout” is a partial map from tensor-dimension to mesh-dimension specify which tensor-dimensions are split across which dimensions of the processor mesh. mesh_shape: how processors are organized [(’all’,n)], [(’row’,r),(’cols’,c)] computational_layout: how to split data. [(‘batch’,all)], [(’batch’,’rows’),(’hidden’,’cols’)]

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 21 / 33

slide-22
SLIDE 22

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Example: Two Fully-Connected layers

two fully-connected layers in the middle of a neural network The input layer x and the output layer y each have dio units The hidden layer h has dh units and acitvated by Relu y = Relu(xw + bias)v

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 22 / 33

slide-23
SLIDE 23

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Illustration for Two Fully-Connected Layers

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 23 / 33

slide-24
SLIDE 24

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Data-Parallel Layout

mesh_shape =[(’all’,n)] computation_layout = [(’batch’,’all’)] The number of values need to be communicated per processor is

  • 2diodh. (Number of parameters)

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 24 / 33

slide-25
SLIDE 25

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Model-Parallel Layout

mesh_shape =[(’all’,n)] computation_layout = [(’hidden’,’all’)] The number of values need to be communicated/processor is 2bdio (when computing y and gradient of x)

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 25 / 33

slide-26
SLIDE 26

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Data-parallel, Model-Parallel Layout

mesh_shape =[(’rows’,r),(’cols’,c)] computation_layout = [(’batch’,’rows’),(’hidden’,’cols’)] each row of processors handles a fraction of the batch each column of processors handles a fraction of the hidden units The number of values need to be communicated/processor is

2diodh c

+ 2bdio

r

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 26 / 33

slide-27
SLIDE 27

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Data-parallel, Model-Parallel Layout (3-dimension)

mesh_shape =[(’rows’,r),(’cols’,c),(’planes’,p)] computation_layout = [(’batch’,’rows’),(’hidden’,’cols’),(’io’,’planes’)] fjrst dimension of processors handles a fraction of the batch second dimension of processors handles a fraction of the hidden units third dimension of processors handles a fraction of io units The number of values need to be communicated/processor is

2diodh c∗p + 2bdio+ r∗p

+ 2bdh

r∗c

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 27 / 33

slide-28
SLIDE 28

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Performance Comparison

communication computation nominator: increse number of batch size quadratically or cubically

  • nly linearly increase the communication

denominator: Linearly increase the batch size and hidden layer size is enough to maintain good effjciency

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 28 / 33

slide-29
SLIDE 29

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Experiments and Results: ”Transformer”

Train Model-parallel layout of the Transformer attention-based sequence-to-sequence model Also train Data-parallel, Model-Parallel Layout of Transformer. (TPUv2 16 × 32 = 512 cores)

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 29 / 33

slide-30
SLIDE 30

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Billion-word language modeling benchmark

Trained the models for 10 epochs The largest model (4.9B parameters) took 13 hours to train on 512-core TPUv2 cluster Batch size for all models was 256 sequences of 256 tokens each

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 30 / 33

slide-31
SLIDE 31

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

WMT14 En-Fr translation tasks

Trained the models for 3 epochs The largest model (2.9B parameters) was trained for 22 hours on a 128-core TPUv2 cluster

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 31 / 33

slide-32
SLIDE 32

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conclusion

Mesh-TensorFlow language facilitats a broad class of SPMD distributed tensor computations Have the ability to train very large models (5 billion parameters) on large clusters (512-core) Performance is excellent, establishing new state-of-the-art results for WMT14 En-Fr translation task and the One Billion Word language modeling benchmark

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 32 / 33

slide-33
SLIDE 33

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Thank you

Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 33 / 33