[PPT] - Introduction to Deep Learning: Concepts and Terminologies CSE PowerPoint Presentation

SLIDE 1

Introduction to Deep Learning: Concepts and Terminologies

CSE 5194.01 Autumn ‘20 Arpan Jain The Ohio State University E-mail: jain.575@osu.edu

SLIDE 2

CSE 5194.01

2

Network Based Computing Laboratory

Introduction
DNN Training
Essential Concepts
Parallel and Distributed DNN Training

Outline

SLIDE 3

CSE 5194.01

3

Network Based Computing Laboratory

Deep Learning

Sour urce: https://thenewstack.io/demystifying-deep eep-le learnin ing-and-artif ific icia ial-intel telligence/ e/

Deep Learning
Uses Deep Neural Networks and its

variants

Based on learning data representation
It can be supervised or unsupervised
Examples Convolutional Neural

Network (CNN), Recurrent Neural Network, Hybrid Networks

According to Yoshua Bengio

“Deep learning algorithms seek to exploit the unknown structure in the input distribution in order to discover good representations, often at multiple levels, with higher-level learned features defined in terms of lower-level features”

SLIDE 4

CSE 5194.01

4

Network Based Computing Laboratory

Machine Learning - Ability of machines to learn without being programmed
Supervised Learning - We provide the machine with the “right answers” (labels)

– Classification – Discrete value output (e.g. email is spam or not-spam) – Regression – Continuous output values (e.g. house prices)

Unsupervised Learning - No “right answers” given. Learn yourself; no labels for you!

– Clustering – Group the data points that are ”close” to each other (e.g. cocktail party problem)

finding structure in data is the key here!
Features – Input attributes (e.g. tumor size, age, etc. in cancer detection problem)

– A very important concept in learning so please remember this!

Deep Learning – learning that uses Deep Neural Networks

One Line (Unofficial) Definitions

SLIDE 5

CSE 5194.01

5

Network Based Computing Laboratory

Left Picture: Supervised/Unsupervised?
Right Picture: Supervised/Unsupervised?

Spot Quiz: Supervised vs. Unsupervised?

X1 X1 X2 X2

What is X1 and X2?
What do colors/shapes represent?
What is the green line?

SLIDE 6

CSE 5194.01

6

Network Based Computing Laboratory

To actually train a network, please visit: http://playground.tensorflow.org

TensorFlow playground

SLIDE 7

CSE 5194.01

7

Network Based Computing Laboratory

To try handwritten numbers, please visit: https://microsoft.github.io/onnxjs-demo/#/mnist

Handwritten Numbers (Quick Demo)

SLIDE 8

CSE 5194.01

8

Network Based Computing Laboratory

Introduction
DNN Training
Essential Concepts
Parallel and Distributed DNN Training

Outline

SLIDE 9

CSE 5194.01

9

Network Based Computing Laboratory

DNN Training: Forward Pass

Input Layer Hidden Layer Hidden Layer Output Layer

SLIDE 10

CSE 5194.01

10

Network Based Computing Laboratory

DNN Training: Forward Pass

Input Layer Hidden Layer Hidden Layer Output Layer

Forward Pass

W1 W2

X

SLIDE 11

CSE 5194.01

11

Network Based Computing Laboratory

DNN Training: Forward Pass

Input Layer Hidden Layer Hidden Layer Output Layer

Forward Pass

W1 W2 W5 W4 W3 W6

X

SLIDE 12

CSE 5194.01

12

Network Based Computing Laboratory

DNN Training: Forward Pass

Input Layer Hidden Layer Hidden Layer Output Layer

Forward Pass

W1 W2 W5 W4 W3 W6 W7 W8

X

SLIDE 13

CSE 5194.01

13

Network Based Computing Laboratory

DNN Training: Forward Pass

Input Layer Hidden Layer Hidden Layer Output Layer

Forward Pass

W1 W2 W5 W4 W3 W6 W7 W8

X Pred

Error = Loss(Pred,Output)

SLIDE 14

CSE 5194.01

14

Network Based Computing Laboratory

DNN Training: Backward Pass

Input Layer Hidden Layer Hidden Layer Output Layer

Forward Pass

E7 E8

Backward Pass

Error = Loss(Pred,Output)

SLIDE 15

CSE 5194.01

15

Network Based Computing Laboratory

DNN Training: Backward Pass

Input Layer Hidden Layer Hidden Layer Output Layer

Forward Pass

E5 E4 E3 E6 E7 E8

Backward Pass

Error = Loss(Pred,Output)

SLIDE 16

CSE 5194.01

16

Network Based Computing Laboratory

DNN Training: Backward Pass

Input Layer Hidden Layer Hidden Layer Output Layer

Forward Pass

E1 E2 E5 E4 E3 E6 E7 E8

Backward Pass

Error = Loss(Pred,Output)

SLIDE 17

CSE 5194.01

17

Network Based Computing Laboratory

DNN Training

SLIDE 18

CSE 5194.01

18

Network Based Computing Laboratory

Introduction
DNN Training
Essential Concepts
Parallel and Distributed DNN Training

Outline

SLIDE 19

CSE 5194.01

19

Network Based Computing Laboratory

Back-propagation involves

complicated mathematics.

– Luckily, most DL Frameworks give you a one line implementation -- model.backward()

Essential Concepts: Activation function and Back-propagation

I encourage everyone to take CSE 5526!

What are Activation functions?

– RELU (a Max fn.) is the most common activation fn. – Sigmoid, tanh, etc. are also used

Courtesy: https://www.jeremyjordan.me/neural-networks-training/

SLIDE 20

CSE 5194.01

20

Network Based Computing Laboratory

Goal of SGD:

– Minimize a cost fn. – J(θ) as a function of θ

SGD is iterative
Only two equations to

remember: θi := θi + Δθi Δθi = −α * (∂J(θ) / ∂θi)

α = learning rate

Essential Concepts: Stochastic Gradient Descent (SGD)

Courtesy: https://www.jeremyjordan.me/gradient-descent/

SLIDE 21

CSE 5194.01

21

Network Based Computing Laboratory

Essential Concepts: Learning Rate (α)

Courtesy: https://www.jeremyjordan.me/nn-learning-rate/

SLIDE 22

CSE 5194.01

22

Network Based Computing Laboratory

Batched Gradient Descent

– Batch Size = N

Stochastic Gradient Descent

– Batch Size = 1

Mini-batch Gradient Descent

– Somewhere in the middle – Common:

Batch Size = 64, 128, 256, etc.
Finding the optimal batch

size will yield the fastest learning.

Essential Concepts: Batch Size

Courtesy: https://www.jeremyjordan.me/gradient-descent/

N

Batch Size One full pass over N is called an epoch of training

SLIDE 23

CSE 5194.01

23

Network Based Computing Laboratory

Mini-batch Gradient Descent (Example)

SLIDE 24

CSE 5194.01

24

Network Based Computing Laboratory

How to define the “size” of a model? (model is also called a DNN or a network)
Size means several things and context is important

– Model Size: # of parameters (weights on edges) – Model Size: # of layers (model depth)

Essential Concepts: Model Size

Model Depth (No. of Layers) Weights on Edges

SLIDE 25

CSE 5194.01

25

Network Based Computing Laboratory

What is the end goal of training a model with SGD and Back-propagation?

– Of course, train the machine to predict something useful for you

How do we measure success?

– Well, accuracy of the trained model on “new” data is the metric of success

How quickly we can reach there is:

– ”good to have” for some models – “practically necessary” for most state-of-the-art models – In Computer Vision: images/second is the metric of throughput/speed

Why?

– Let’s hear some opinions from the class

Essential Concepts: Accuracy and Throughput (Speed)

SLIDE 26

CSE 5194.01

26

Network Based Computing Laboratory

Introduction
DNN Training
Essential Concepts
Parallel and Distributed DNN Training

Outline

SLIDE 27

CSE 5194.01

27

Network Based Computing Laboratory

Large models  better accuracy
More data  better accuracy
Single-node Training; good for

– Small model and small dataset

Distributed Training; good for:

– Large models and large datasets

Impact of Model Size and Dataset Size

Courtesy: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks

model > data data > model

SLIDE 28

CSE 5194.01

28

Network Based Computing Laboratory

Overfitting – model > data  so model is not learning but memorizing your data
Underfitting – data > model  so model is not learning because it cannot capture the

complexity of your data

Overfitting and Underfitting

Courtesy: https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html

SLIDE 29

CSE 5194.01

29

Network Based Computing Laboratory

What are the Parallelization Strategies

– Model Parallelism – Data Parallelism (Received the most attention) – Hybrid Parallelism – Automatic Selection

Parallelization Strategies

Model Parallelism Data Parallelism Hybrid (Model and Data) Parallelism

Courtesy: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks

SLIDE 30

CSE 5194.01

30

Network Based Computing Laboratory

Drawback: If the dataset has 1 million images, then it will take forever to run the model on such a big dataset Solution: Can we use multiple machines to speedup the training of Deep learning models? (i.e. Utilize Supercomputers to Parallelize)

Need for Data Parallelism

Let’s revisit Mini-Batch Gradient Descent

SLIDE 31

CSE 5194.01

31

Network Based Computing Laboratory

Need for Communication in Data Parallelism

Y N Y Y N Y Y Y Y N Y Y N Y Y Y Y N Y Y N Y Y Y Y N Y Y N Y Y Y Y N Y Y N Y Y Y

Machine 1 Machine 2 Machine 3 Machine 4 Machine 5

Problem: Train a single model on whole dataset, not 5 models on different sets of dataset

SLIDE 32

CSE 5194.01

32

Network Based Computing Laboratory

Gradients Machine 1 Machine 2 Machine 3 Machine 4 Machine 5

1 7 2 2 2 4 5 1 3 1 3 2 1 2 3 5 1 2 2 1 3 5 5 2 5 1 1 5 7 1 2 1 2 4 1 3 3 1 1 2 2 1 2 1 2

Data Parallelism

MPI AllReduce

Reduced Gradients

1219 9 121212 21 5 11 1219 9 121212 21 5 11 1219 9 121212 21 5 11 1219 9 121212 21 5 11 1219 9 121212 21 5 11

SLIDE 33

CSE 5194.01

33

Network Based Computing Laboratory

Data Parallelism

Step1: Data Propagation

– Distribute the Data among GPUs

Step2: Forward Backward Pass

– Perform forward pass and calculate the prediction – Calculate Error by comparing prediction with actual output – Perform backward pass and calculate gradients

Step3: Gradient Aggregation

– Call MPI_Allreduce to reduce the local gradients – Update parameters locally using global gradients

SLIDE 34

CSE 5194.01

34

Network Based Computing Laboratory

Impact of Large Batch Size

Courtesy: https://research.fb.com/publications/imagenet1kin1h/ 50 100 150 200 250 8 16 32 64 128 Training Time (seconds)

No. of GPUs

GoogLeNet (ImageNet) on 128 GPUs

Caffe OSU-Caffe (1024) OSU-Caffe (2048)

Large Batch Size is bad for Accuracy But good for speed and scalability

A. A. Awan et al., S-Caffe: Co-designing MPI Runtimes and Caffe

for Scalable Deep Learning on Modern GPU Clusters. PPoPP '17

SLIDE 35

CSE 5194.01

35

Network Based Computing Laboratory

Epochs per second (EPS)?

– A variant of images/second – Basically, what is the speed of training the model

Accuracy per Epoch (APE)?

– E.g. 60% in one full pass over the dataset

Async  Higher EPS but lower APE
Sync  Higher APE but lower EPS

Synchronous vs. Asynchronous Training

Courtesy: http://engineering.skymind.io/distributed-deep-learning- part-1-an-introduction-to-distributed-training-of-neural-networks

SLIDE 36

CSE 5194.01

36

Network Based Computing Laboratory

The concepts and terminologies discussed today will keep coming

during the next lectures

Please clarify any confusions early on
Future papers and presentations will use these concepts in more

complex ways!

Questions/Comments?

Review and Conclusion

SLIDE 37

CSE 5194.01

37

Network Based Computing Laboratory

Thank You!

The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/ Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/ Jain.575@osu.edu