Introduction to Deep Learning: Concepts and Terminologies CSE - - PowerPoint PPT Presentation

introduction to deep learning concepts and terminologies
SMART_READER_LITE
LIVE PREVIEW

Introduction to Deep Learning: Concepts and Terminologies CSE - - PowerPoint PPT Presentation

Introduction to Deep Learning: Concepts and Terminologies CSE 5194.01 Autumn 20 Arpan Jain The Ohio State University E-mail: jain.575@osu.edu Outline Introduction DNN Training Essential Concepts Parallel and Distributed DNN


slide-1
SLIDE 1

Introduction to Deep Learning: Concepts and Terminologies

CSE 5194.01 Autumn ‘20 Arpan Jain The Ohio State University E-mail: jain.575@osu.edu

slide-2
SLIDE 2

CSE 5194.01

2

Network Based Computing Laboratory

  • Introduction
  • DNN Training
  • Essential Concepts
  • Parallel and Distributed DNN Training

Outline

slide-3
SLIDE 3

CSE 5194.01

3

Network Based Computing Laboratory

Deep Learning

Sour urce: https://thenewstack.io/demystifying-deep eep-le learnin ing-and-artif ific icia ial-intel telligence/ e/

  • Deep Learning
  • Uses Deep Neural Networks and its

variants

  • Based on learning data representation
  • It can be supervised or unsupervised
  • Examples Convolutional Neural

Network (CNN), Recurrent Neural Network, Hybrid Networks

  • According to Yoshua Bengio

“Deep learning algorithms seek to exploit the unknown structure in the input distribution in order to discover good representations, often at multiple levels, with higher-level learned features defined in terms of lower-level features”

slide-4
SLIDE 4

CSE 5194.01

4

Network Based Computing Laboratory

  • Machine Learning - Ability of machines to learn without being programmed
  • Supervised Learning - We provide the machine with the “right answers” (labels)

– Classification – Discrete value output (e.g. email is spam or not-spam) – Regression – Continuous output values (e.g. house prices)

  • Unsupervised Learning - No “right answers” given. Learn yourself; no labels for you!

– Clustering – Group the data points that are ”close” to each other (e.g. cocktail party problem)

  • finding structure in data is the key here!
  • Features – Input attributes (e.g. tumor size, age, etc. in cancer detection problem)

– A very important concept in learning so please remember this!

  • Deep Learning – learning that uses Deep Neural Networks

One Line (Unofficial) Definitions

slide-5
SLIDE 5

CSE 5194.01

5

Network Based Computing Laboratory

  • Left Picture: Supervised/Unsupervised?
  • Right Picture: Supervised/Unsupervised?

Spot Quiz: Supervised vs. Unsupervised?

X1 X1 X2 X2

  • What is X1 and X2?
  • What do colors/shapes represent?
  • What is the green line?
slide-6
SLIDE 6

CSE 5194.01

6

Network Based Computing Laboratory

  • To actually train a network, please visit: http://playground.tensorflow.org

TensorFlow playground

slide-7
SLIDE 7

CSE 5194.01

7

Network Based Computing Laboratory

  • To try handwritten numbers, please visit: https://microsoft.github.io/onnxjs-demo/#/mnist

Handwritten Numbers (Quick Demo)

slide-8
SLIDE 8

CSE 5194.01

8

Network Based Computing Laboratory

  • Introduction
  • DNN Training
  • Essential Concepts
  • Parallel and Distributed DNN Training

Outline

slide-9
SLIDE 9

CSE 5194.01

9

Network Based Computing Laboratory

DNN Training: Forward Pass

Input Layer Hidden Layer Hidden Layer Output Layer

slide-10
SLIDE 10

CSE 5194.01

10

Network Based Computing Laboratory

DNN Training: Forward Pass

Input Layer Hidden Layer Hidden Layer Output Layer

Forward Pass

W1 W2

X

slide-11
SLIDE 11

CSE 5194.01

11

Network Based Computing Laboratory

DNN Training: Forward Pass

Input Layer Hidden Layer Hidden Layer Output Layer

Forward Pass

W1 W2 W5 W4 W3 W6

X

slide-12
SLIDE 12

CSE 5194.01

12

Network Based Computing Laboratory

DNN Training: Forward Pass

Input Layer Hidden Layer Hidden Layer Output Layer

Forward Pass

W1 W2 W5 W4 W3 W6 W7 W8

X

slide-13
SLIDE 13

CSE 5194.01

13

Network Based Computing Laboratory

DNN Training: Forward Pass

Input Layer Hidden Layer Hidden Layer Output Layer

Forward Pass

W1 W2 W5 W4 W3 W6 W7 W8

X Pred

Error = Loss(Pred,Output)

slide-14
SLIDE 14

CSE 5194.01

14

Network Based Computing Laboratory

DNN Training: Backward Pass

Input Layer Hidden Layer Hidden Layer Output Layer

Forward Pass

E7 E8

Backward Pass

Error = Loss(Pred,Output)

slide-15
SLIDE 15

CSE 5194.01

15

Network Based Computing Laboratory

DNN Training: Backward Pass

Input Layer Hidden Layer Hidden Layer Output Layer

Forward Pass

E5 E4 E3 E6 E7 E8

Backward Pass

Error = Loss(Pred,Output)

slide-16
SLIDE 16

CSE 5194.01

16

Network Based Computing Laboratory

DNN Training: Backward Pass

Input Layer Hidden Layer Hidden Layer Output Layer

Forward Pass

E1 E2 E5 E4 E3 E6 E7 E8

Backward Pass

Error = Loss(Pred,Output)

slide-17
SLIDE 17

CSE 5194.01

17

Network Based Computing Laboratory

DNN Training

slide-18
SLIDE 18

CSE 5194.01

18

Network Based Computing Laboratory

  • Introduction
  • DNN Training
  • Essential Concepts
  • Parallel and Distributed DNN Training

Outline

slide-19
SLIDE 19

CSE 5194.01

19

Network Based Computing Laboratory

  • Back-propagation involves

complicated mathematics.

– Luckily, most DL Frameworks give you a one line implementation -- model.backward()

Essential Concepts: Activation function and Back-propagation

I encourage everyone to take CSE 5526!

  • What are Activation functions?

– RELU (a Max fn.) is the most common activation fn. – Sigmoid, tanh, etc. are also used

Courtesy: https://www.jeremyjordan.me/neural-networks-training/

slide-20
SLIDE 20

CSE 5194.01

20

Network Based Computing Laboratory

  • Goal of SGD:

– Minimize a cost fn. – J(θ) as a function of θ

  • SGD is iterative
  • Only two equations to

remember: θi := θi + Δθi Δθi = −α * (∂J(θ) / ∂θi)

  • α = learning rate

Essential Concepts: Stochastic Gradient Descent (SGD)

Courtesy: https://www.jeremyjordan.me/gradient-descent/

slide-21
SLIDE 21

CSE 5194.01

21

Network Based Computing Laboratory

Essential Concepts: Learning Rate (α)

Courtesy: https://www.jeremyjordan.me/nn-learning-rate/

slide-22
SLIDE 22

CSE 5194.01

22

Network Based Computing Laboratory

  • Batched Gradient Descent

– Batch Size = N

  • Stochastic Gradient Descent

– Batch Size = 1

  • Mini-batch Gradient Descent

– Somewhere in the middle – Common:

  • Batch Size = 64, 128, 256, etc.
  • Finding the optimal batch

size will yield the fastest learning.

Essential Concepts: Batch Size

Courtesy: https://www.jeremyjordan.me/gradient-descent/

N

Batch Size One full pass over N is called an epoch of training

slide-23
SLIDE 23

CSE 5194.01

23

Network Based Computing Laboratory

Mini-batch Gradient Descent (Example)

slide-24
SLIDE 24

CSE 5194.01

24

Network Based Computing Laboratory

  • How to define the “size” of a model? (model is also called a DNN or a network)
  • Size means several things and context is important

– Model Size: # of parameters (weights on edges) – Model Size: # of layers (model depth)

Essential Concepts: Model Size

Model Depth (No. of Layers) Weights on Edges

slide-25
SLIDE 25

CSE 5194.01

25

Network Based Computing Laboratory

  • What is the end goal of training a model with SGD and Back-propagation?

– Of course, train the machine to predict something useful for you

  • How do we measure success?

– Well, accuracy of the trained model on “new” data is the metric of success

  • How quickly we can reach there is:

– ”good to have” for some models – “practically necessary” for most state-of-the-art models – In Computer Vision: images/second is the metric of throughput/speed

  • Why?

– Let’s hear some opinions from the class

Essential Concepts: Accuracy and Throughput (Speed)

slide-26
SLIDE 26

CSE 5194.01

26

Network Based Computing Laboratory

  • Introduction
  • DNN Training
  • Essential Concepts
  • Parallel and Distributed DNN Training

Outline

slide-27
SLIDE 27

CSE 5194.01

27

Network Based Computing Laboratory

  • Large models  better accuracy
  • More data  better accuracy
  • Single-node Training; good for

– Small model and small dataset

  • Distributed Training; good for:

– Large models and large datasets

Impact of Model Size and Dataset Size

Courtesy: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks

model > data data > model

slide-28
SLIDE 28

CSE 5194.01

28

Network Based Computing Laboratory

  • Overfitting – model > data  so model is not learning but memorizing your data
  • Underfitting – data > model  so model is not learning because it cannot capture the

complexity of your data

Overfitting and Underfitting

Courtesy: https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html

slide-29
SLIDE 29

CSE 5194.01

29

Network Based Computing Laboratory

  • What are the Parallelization Strategies

– Model Parallelism – Data Parallelism (Received the most attention) – Hybrid Parallelism – Automatic Selection

Parallelization Strategies

Model Parallelism Data Parallelism Hybrid (Model and Data) Parallelism

Courtesy: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks

slide-30
SLIDE 30

CSE 5194.01

30

Network Based Computing Laboratory

Drawback: If the dataset has 1 million images, then it will take forever to run the model on such a big dataset Solution: Can we use multiple machines to speedup the training of Deep learning models? (i.e. Utilize Supercomputers to Parallelize)

Need for Data Parallelism

Let’s revisit Mini-Batch Gradient Descent

slide-31
SLIDE 31

CSE 5194.01

31

Network Based Computing Laboratory

Need for Communication in Data Parallelism

Y N Y Y N Y Y Y Y N Y Y N Y Y Y Y N Y Y N Y Y Y Y N Y Y N Y Y Y Y N Y Y N Y Y Y

Machine 1 Machine 2 Machine 3 Machine 4 Machine 5

Problem: Train a single model on whole dataset, not 5 models on different sets of dataset

slide-32
SLIDE 32

CSE 5194.01

32

Network Based Computing Laboratory

Gradients Machine 1 Machine 2 Machine 3 Machine 4 Machine 5

1 7 2 2 2 4 5 1 3 1 3 2 1 2 3 5 1 2 2 1 3 5 5 2 5 1 1 5 7 1 2 1 2 4 1 3 3 1 1 2 2 1 2 1 2

Data Parallelism

MPI AllReduce

Reduced Gradients

1219 9 121212 21 5 11 1219 9 121212 21 5 11 1219 9 121212 21 5 11 1219 9 121212 21 5 11 1219 9 121212 21 5 11

slide-33
SLIDE 33

CSE 5194.01

33

Network Based Computing Laboratory

Data Parallelism

  • Step1: Data Propagation

– Distribute the Data among GPUs

  • Step2: Forward Backward Pass

– Perform forward pass and calculate the prediction – Calculate Error by comparing prediction with actual output – Perform backward pass and calculate gradients

  • Step3: Gradient Aggregation

– Call MPI_Allreduce to reduce the local gradients – Update parameters locally using global gradients

slide-34
SLIDE 34

CSE 5194.01

34

Network Based Computing Laboratory

Impact of Large Batch Size

Courtesy: https://research.fb.com/publications/imagenet1kin1h/ 50 100 150 200 250 8 16 32 64 128 Training Time (seconds)

  • No. of GPUs

GoogLeNet (ImageNet) on 128 GPUs

Caffe OSU-Caffe (1024) OSU-Caffe (2048)

Large Batch Size is bad for Accuracy But good for speed and scalability

  • A. A. Awan et al., S-Caffe: Co-designing MPI Runtimes and Caffe

for Scalable Deep Learning on Modern GPU Clusters. PPoPP '17

slide-35
SLIDE 35

CSE 5194.01

35

Network Based Computing Laboratory

  • Epochs per second (EPS)?

– A variant of images/second – Basically, what is the speed of training the model

  • Accuracy per Epoch (APE)?

– E.g. 60% in one full pass over the dataset

  • Async  Higher EPS but lower APE
  • Sync  Higher APE but lower EPS

Synchronous vs. Asynchronous Training

Courtesy: http://engineering.skymind.io/distributed-deep-learning- part-1-an-introduction-to-distributed-training-of-neural-networks

slide-36
SLIDE 36

CSE 5194.01

36

Network Based Computing Laboratory

  • The concepts and terminologies discussed today will keep coming

during the next lectures

  • Please clarify any confusions early on
  • Future papers and presentations will use these concepts in more

complex ways!

  • Questions/Comments?

Review and Conclusion

slide-37
SLIDE 37

CSE 5194.01

37

Network Based Computing Laboratory

Thank You!

The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/ Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/ Jain.575@osu.edu