Convolutional Neural Networks (CNNs) and Recurrent Neural Networks - - PowerPoint PPT Presentation

β–Ά
convolutional neural networks cnns and recurrent neural
SMART_READER_LITE
LIVE PREVIEW

Convolutional Neural Networks (CNNs) and Recurrent Neural Networks - - PowerPoint PPT Presentation

Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) CMSC 678 UMBC Recap from last time Feed-Forward Neural Network: Multilayer Perceptron + 0 ) = (


slide-1
SLIDE 1

Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)

CMSC 678 UMBC

slide-2
SLIDE 2

Recap from last time…

slide-3
SLIDE 3

Feed-Forward Neural Network: Multilayer Perceptron

𝑦

β„Žπ‘— = 𝐺(𝐱𝐣

π‘ˆπ‘¦ + 𝑐0)

β„Ž 𝑧 𝑧1 𝑧2

F: (non-linear) activation function Classification: softmax Regression: identity G: (non-linear) activation function π‘§π‘˜ = G(𝛄𝐀

π‘ˆβ„Ž + 𝑐1)

𝜸 𝐱𝟐 π±πŸ‘ π±πŸ’ π±πŸ“ information/ computation flow no self-loops (recurrence/reuse of weights)

slide-4
SLIDE 4

Flavors of Gradient Descent

Set t = 0 Pick a starting value ΞΈt Until converged: set gt = 0 for example(s) i in full data:

  • 1. Compute loss l on xi
  • 2. Accumulate gradient

g t += l’(xi) done Get scaling factor ρ t Set ΞΈ t+1 = ΞΈ t - ρ t *g t Set t += 1 Set t = 0 Pick a starting value ΞΈt Until converged: get batch B βŠ‚ full data set gt = 0 for example(s) i in B:

  • 1. Compute loss l on xi
  • 2. Accumulate gradient

g t += l’(xi) done Get scaling factor ρ t Set ΞΈ t+1 = ΞΈ t - ρ t *g t Set t += 1 Set t = 0 Pick a starting value ΞΈt Until converged: for example i in full data:

  • 1. Compute loss l on xi
  • 2. Get gradient

g t = l’(xi)

  • 3. Get scaling factor ρ t
  • 4. Set ΞΈ t+1 = ΞΈ t - ρ t *g t
  • 5. Set t += 1

done

β€œOnline” β€œMinibatch” β€œBatch”

slide-5
SLIDE 5

Dropout: Regularization in Neural Networks

𝑦 β„Ž 𝑧 𝑧1 𝑧2 𝜸 𝐱𝟐 π±πŸ‘ π±πŸ’ π±πŸ“

randomly ignore β€œneurons” (hi) during training

slide-6
SLIDE 6

tanh Activation

tanhs 𝑦 = 2 1 + exp(βˆ’2 βˆ— 𝑑 βˆ— 𝑦) βˆ’ 1 = 2πœπ‘‘ 𝑦 βˆ’ 1 s=10 s=0.5 s=1

slide-7
SLIDE 7

Rectifiers Activations

relu 𝑦 = max(0, 𝑦) softplus 𝑦 = log(1 + exp 𝑦 ) leaky_relu 𝑦 = α‰Š0.01𝑦, 𝑦 < 0 𝑦, 𝑦 β‰₯ 0

slide-8
SLIDE 8

Outline

Convolutional Neural Networks

What is a convolution? Multidimensional Convolutions Typical Convnet Operations Deep convnets

Recurrent Neural Networks

Types of recurrence A basic recurrent cell BPTT: Backpropagation through time

slide-9
SLIDE 9

Dot Product

βˆ‘

π‘¦π‘ˆπ‘§ = ෍

𝑙

𝑦𝑙𝑧𝑙

slide-10
SLIDE 10

Convolution: Modified Dot Product Around a Point

βˆ‘

π‘¦π‘ˆπ‘§ 𝑗 = ෍

𝑙<𝐿

𝑦𝑙+𝑗𝑧𝑙

Convolution/cross-correlation

slide-11
SLIDE 11

Convolution: Modified Dot Product Around a Point

βˆ‘

π‘¦π‘ˆπ‘§ 𝑗 = ෍

𝑙

𝑦𝑙+𝑗𝑧𝑙 𝑦 ⋆ 𝑧 𝑗 =

Convolution/cross-correlation

slide-12
SLIDE 12

Convolution: Modified Dot Product Around a Point

βˆ‘

π‘¦π‘ˆπ‘§ 𝑗 = ෍

𝑙

𝑦𝑙+𝑗𝑧𝑙 𝑦 ⋆ 𝑧 𝑗 =

Convolution/cross-correlation

slide-13
SLIDE 13

Convolution: Modified Dot Product Around a Point

βˆ‘

π‘¦π‘ˆπ‘§ 𝑗 = ෍

𝑙

𝑦𝑙+𝑗𝑧𝑙 𝑦 ⋆ 𝑧 𝑗 =

Convolution/cross-correlation

slide-14
SLIDE 14

Convolution: Modified Dot Product Around a Point

βˆ‘

𝑦 ⋆ 𝑧 𝑗 = π‘¦π‘ˆπ‘§ 𝑗 = ෍

𝑙

𝑦𝑙+𝑗𝑧𝑙

Convolution/cross-correlation

slide-15
SLIDE 15

Convolution: Modified Dot Product Around a Point

βˆ‘

𝑦 ⋆ 𝑧 = π‘¦π‘ˆπ‘§ 𝑗 = ෍

𝑙

𝑦𝑙+𝑗𝑧𝑙

Convolution/cross-correlation

feature map kernel input (β€œimage”)

1-D convolution

slide-16
SLIDE 16

Outline

Convolutional Neural Networks

What is a convolution? Multidimensional Convolutions Typical Convnet Operations Deep convnets

Recurrent Neural Networks

Types of recurrence A basic recurrent cell BPTT: Backpropagation through time

slide-17
SLIDE 17

2-D Convolution

kernel input (β€œimage”)

width: shape of the kernel (often square)

slide-18
SLIDE 18

2-D Convolution

input (β€œimage”)

stride(s): how many spaces to move the kernel width: shape of the kernel (often square)

slide-19
SLIDE 19

2-D Convolution

input (β€œimage”)

stride(s): how many spaces to move the kernel stride=1 width: shape of the kernel (often square)

slide-20
SLIDE 20

2-D Convolution

input (β€œimage”)

stride(s): how many spaces to move the kernel stride=1 width: shape of the kernel (often square)

slide-21
SLIDE 21

2-D Convolution

input (β€œimage”)

stride(s): how many spaces to move the kernel stride=1 width: shape of the kernel (often square)

slide-22
SLIDE 22

2-D Convolution

input (β€œimage”)

stride(s): how many spaces to move the kernel stride=2 width: shape of the kernel (often square)

slide-23
SLIDE 23

2-D Convolution

input (β€œimage”)

stride(s): how many spaces to move the kernel stride=2 width: shape of the kernel (often square) skip starting here

slide-24
SLIDE 24

2-D Convolution

input (β€œimage”)

stride(s): how many spaces to move the kernel stride=2 width: shape of the kernel (often square) skip starting here

slide-25
SLIDE 25

2-D Convolution

input (β€œimage”)

stride(s): how many spaces to move the kernel stride=2 width: shape of the kernel (often square) skip starting here

slide-26
SLIDE 26

2-D Convolution

input (β€œimage”)

width: shape of the kernel (often square) stride(s): how many spaces to move the kernel padding: how to handle input/kernel shape mismatches β€œsame”: input.shape == output.shape β€œdifferent”: input.shape β‰  output.shape pad with 0s (one option)

slide-27
SLIDE 27

2-D Convolution

input (β€œimage”)

width: shape of the kernel (often square) stride(s): how many spaces to move the kernel padding: how to handle input/kernel shape mismatches β€œsame”: input.shape == output.shape β€œdifferent”: input.shape β‰  output.shape pad with 0s (another option) pad with 0s (another option)

slide-28
SLIDE 28

2-D Convolution

input (β€œimage”)

width: shape of the kernel (often square) stride(s): how many spaces to move the kernel padding: how to handle input/kernel shape mismatches β€œsame”: input.shape == output.shape β€œdifferent”: input.shape β‰  output.shape

slide-29
SLIDE 29

From fully connected to convolutional networks

image Fully connected layer

Slide credit: Svetlana Lazebnik

slide-30
SLIDE 30

image feature map learned weights

From fully connected to convolutional networks

Convolutional layer

Slide credit: Svetlana Lazebnik

slide-31
SLIDE 31

image feature map learned weights

From fully connected to convolutional networks

Convolutional layer

Slide credit: Svetlana Lazebnik

slide-32
SLIDE 32

Convolution as feature extraction

Input Feature Map . . . Filters/Kernels

Slide credit: Svetlana Lazebnik

slide-33
SLIDE 33

image feature map learned weights

From fully connected to convolutional networks

Convolutional layer

Slide credit: Svetlana Lazebnik

slide-34
SLIDE 34

image next layer Convolutional layer

From fully connected to convolutional networks

non-linearity and/or pooling

Slide adapted: Svetlana Lazebnik

slide-35
SLIDE 35

Outline

Convolutional Neural Networks

What is a convolution? Multidimensional Convolutions Typical Convnet Operations Deep convnets

Recurrent Neural Networks

Types of recurrence A basic recurrent cell BPTT: Backpropagation through time Solving vanishing gradients problem

slide-36
SLIDE 36

Input Image Convolution (Learned) Non-linearity Spatial pooling Feature maps

Input Feature Map . . .

Key operations in a CNN

Slide credit: Svetlana Lazebnik, R. Fergus, Y. LeCun

slide-37
SLIDE 37

Input Image Convolution (Learned) Non-linearity Spatial pooling Feature maps

Key operations

Example: Rectified Linear Unit (ReLU)

Slide credit: Svetlana Lazebnik, R. Fergus, Y. LeCun

slide-38
SLIDE 38

Input Image Convolution (Learned) Non-linearity Spatial pooling Feature maps

Max

Key operations

Slide credit: Svetlana Lazebnik, R. Fergus, Y. LeCun

slide-39
SLIDE 39

Design principles

Reduce filter sizes (except possibly at the lowest layer), factorize filters aggressively Use 1x1 convolutions to reduce and expand the number of feature maps judiciously Use skip connections and/or create multiple paths through the network

Slide credit: Svetlana Lazebnik

slide-40
SLIDE 40

LeNet-5

  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition,
  • Proc. IEEE 86(11): 2278–2324, 1998.

Slide credit: Svetlana Lazebnik

slide-41
SLIDE 41

ImageNet

~14 million labeled images, 20k classes Images gathered from Internet Human labels via Amazon MTurk ImageNet Large-Scale Visual Recognition Challenge (ILSVRC): 1.2 million training images, 1000 classes

www.image-net.org/challenges/LSVRC/

Slide credit: Svetlana Lazebnik

slide-42
SLIDE 42

http://www.inference.vc/deep-learning-is-easy/ Slide credit: Svetlana Lazebnik

slide-43
SLIDE 43

Outline

Convolutional Neural Networks

What is a convolution? Multidimensional Convolutions Typical Convnet Operations Deep convnets

Recurrent Neural Networks

Types of recurrence A basic recurrent cell BPTT: Backpropagation through time Solving vanishing gradients problem

slide-44
SLIDE 44

AlexNet: ILSVRC 2012 winner

Similar framework to LeNet but: Max pooling, ReLU nonlinearity More data and bigger model (7 hidden layers, 650K units, 60M params) GPU implementation (50x speedup over CPU): Two GPUs for a week Dropout regularization

  • A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional

Neural Networks, NIPS 2012

Slide credit: Svetlana Lazebnik

slide-45
SLIDE 45

GoogLeNet

Slide credit: Svetlana Lazebnik

Szegedy et al., 2015

slide-46
SLIDE 46

GoogLeNet

Slide credit: Svetlana Lazebnik

Szegedy et al., 2015

slide-47
SLIDE 47

GoogLeNet: Auxiliary Classifier at Sub- levels

Idea: try to make each sub-layer good (in its

  • wn way) at the prediction task

Slide credit: Svetlana Lazebnik

Szegedy et al., 2015

slide-48
SLIDE 48

GoogLeNet

  • An alternative view:

Slide credit: Svetlana Lazebnik

Szegedy et al., 2015

slide-49
SLIDE 49

ResNet (Residual Network)

He et al. β€œDeep Residual Learning for Image Recognition” (2016)

Make it easy for network layers to represent the identity mapping Skipping 2+ layers is intentional & needed

Slide credit: Svetlana Lazebnik

slide-50
SLIDE 50

Summary: ILSVRC 2012-2015

Team Year Place Error (top-5) External data SuperVision 2012

  • 16.4%

no SuperVision 2012 1st 15.3% ImageNet 22k Clarifai (7 layers) 2013

  • 11.7%

no Clarifai 2013 1st 11.2% ImageNet 22k VGG (16 layers) 2014 2nd 7.32% no GoogLeNet (19 layers) 2014 1st 6.67% no ResNet (152 layers) 2015 1st 3.57% Human expert* 5.1%

http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/

Slide credit: Svetlana Lazebnik

slide-51
SLIDE 51

Rapid Progress due to CNNs

Classification: ImageNet Challenge top-5 error

Figure source: Kaiming He

Slide credit: Svetlana Lazebnik

slide-52
SLIDE 52

Outline

Convolutional Neural Networks

What is a convolution? Multidimensional Convolutions Typical Convnet Operations Deep convnets

Recurrent Neural Networks

Types of recurrence A basic recurrent cell BPTT: Backpropagation through time

slide-53
SLIDE 53

Network Types

x h y

Feed forward Linearizable feature input Bag-of-items classification/regression Basic non-linear model

slide-54
SLIDE 54

Network Types

x h0 y0

Recursive: One input, Sequence output Automated caption generation

h1 y1 h2 y2

slide-55
SLIDE 55

Network Types

x0 h0

Recursive: Sequence input, one output Document classification Action recognition in video (high-level)

h1 h2 y x1 x2

slide-56
SLIDE 56

Network Types

x0 h0

Recursive: Sequence input, Sequence output (time delay) Machine translation Sequential description Summarization

h1 h2 x1 x2

y0

  • 1

y1

  • 2

y2

  • 3

y3

slide-57
SLIDE 57

Network Types

x0 h0

Recursive: Sequence input, Sequence output Part of speech tagging Action recognition (fine grained)

h1 h2 x1 x2 y0 y1 y2

slide-58
SLIDE 58

RNN Outputs: Image Captions

Show and Tell: A Neural Image Caption Generator, CVPR 15

Slide credit: Arun Mallya

slide-59
SLIDE 59

RNN Output: Visual Storytelling

CNN CNN CNN CNN CNN GRU GRU GRU GRU GRU

Encode Decode

GRUs GRUs

…

The family got together for a cookout They had a lot

  • f delicious

food.

The family got together for a cookout. They had a lot of delicious food. The dog was happy to be there. They had a great time on the beach. They even had a swim in the water. Huang et al. (2016) Human Reference The family has gathered around the dinner table to share a meal

  • together. They all pitched in to help cook the seafood to perfection.

Afterwards they took the family dog to the beach to get some exercise. The waves were cool and refreshing! The dog had so much fun in the

  • water. One family member decided to get a better view of the waves!
slide-60
SLIDE 60

Recurrent Networks

xi-3 xi-2 xi xi-1 hi-3 hi-2 hi-1 hi yi-3 yi-2 yi yi-1

  • bserve these inputs one at a time
slide-61
SLIDE 61

Recurrent Networks

xi-3 xi-2 xi xi-1 hi-3 hi-2 hi-1 hi yi-3 yi-2 yi yi-1

  • bserve these inputs one at a time

predict the corresponding label

slide-62
SLIDE 62

from these hidden states

Recurrent Networks

xi-3 xi-2 xi xi-1 hi-3 hi-2 hi-1 hi yi-3 yi-2 yi yi-1

  • bserve these inputs one at a time

predict the corresponding label

slide-63
SLIDE 63

from these hidden states xi-3 xi-2 xi xi-1 hi-3 hi-2 hi-1 hi yi-3 yi-2 yi yi-1

  • bserve these inputs one at a time

predict the corresponding label

β€œcell”

Recurrent Networks

slide-64
SLIDE 64

Outline

Convolutional Neural Networks

What is a convolution? Multidimensional Convolutions Typical Convnet Operations Deep convnets

Recurrent Neural Networks

Types of recurrence A basic recurrent cell BPTT: Backpropagation through time

slide-65
SLIDE 65

xi xi-1 hi-1 hi yi yi-1

A Simple Recurrent Neural Network Cell

W W U U S S

slide-66
SLIDE 66

encoding

xi xi-1 hi-1 hi yi yi-1

A Simple Recurrent Neural Network Cell

W W U U S S

β„Žπ‘— = tanh(π‘‹β„Žπ‘—βˆ’1 + 𝑉𝑦𝑗)

slide-67
SLIDE 67

decoding encoding

xi xi-1 hi-1 hi yi yi-1

A Simple Recurrent Neural Network Cell

W W U U S S

β„Žπ‘— = tanh(π‘‹β„Žπ‘—βˆ’1 + 𝑉𝑦𝑗) 𝑧𝑗 = softmax(π‘‡β„Žπ‘—)

slide-68
SLIDE 68

decoding encoding

xi xi-1 hi-1 hi yi yi-1

A Simple Recurrent Neural Network Cell

W W U U S S

β„Žπ‘— = tanh(π‘‹β„Žπ‘—βˆ’1 + 𝑉𝑦𝑗) 𝑧𝑗 = softmax(π‘‡β„Žπ‘—)

Weights are shared over time unrolling/unfolding: copy the RNN cell across time (inputs)

slide-69
SLIDE 69

Outline

Convolutional Neural Networks

What is a convolution? Multidimensional Convolutions Typical Convnet Operations Deep convnets

Recurrent Neural Networks

Types of recurrence A basic recurrent cell BPTT: Backpropagation through time

slide-70
SLIDE 70

BackPropagation Through Time (BPTT)

β€œUnfold” the network to create a single, large, feed- forward network

  • 1. Weights are copied (W β†’ W(t))
  • 2. Gradients computed (ð𝑋(𝑒)), and
  • 3. Summed (βˆ‘π‘’ ð𝑋(𝑒))
slide-71
SLIDE 71

BPTT

β„Žπ‘— = tanh(π‘‹β„Žπ‘—βˆ’1 + 𝑉𝑦𝑗) 𝑧𝑗 = softmax(π‘‡β„Žπ‘—)

per-step loss: cross entropy πœ–πΉπ‘— πœ–π‘‹ = πœ–πΉπ‘— πœ–π‘§π‘— πœ–π‘§π‘— πœ–π‘‹ = πœ–πΉπ‘— πœ–π‘§π‘— πœ–π‘§π‘— πœ–β„Žπ‘— πœ–β„Žπ‘— πœ–π‘‹ xi-3 xi-2 xi xi-1 hi-3 hi-2 hi-1 hi yi-3 yi-2 yi yi-1 π‘§π‘—βˆ’3

βˆ—

log π‘ž(π‘§π‘—βˆ’3) π‘§π‘—βˆ’2

βˆ—

log π‘ž(π‘§π‘—βˆ’2) 𝐹𝑗 = 𝑧𝑗

βˆ— log π‘ž(𝑧𝑗)

π‘§π‘—βˆ’1

βˆ—

log π‘ž(π‘§π‘—βˆ’1)

slide-72
SLIDE 72

BPTT

β„Žπ‘— = tanh(π‘‹β„Žπ‘—βˆ’1 + 𝑉𝑦𝑗) 𝑧𝑗 = softmax(π‘‡β„Žπ‘—)

per-step loss: cross entropy πœ–πΉπ‘— πœ–π‘‹ = πœ–πΉπ‘— πœ–π‘§π‘— πœ–π‘§π‘— πœ–π‘‹ = πœ–πΉπ‘— πœ–π‘§π‘— πœ–π‘§π‘— πœ–β„Žπ‘— πœ–β„Žπ‘— πœ–π‘‹ πœ–β„Žπ‘— πœ–π‘‹ = tanhβ€² π‘‹β„Žπ‘—βˆ’1 + 𝑉𝑦𝑗 πœ–π‘‹β„Žπ‘—βˆ’1 πœ–π‘‹ xi-3 xi-2 xi xi-1 hi-3 hi-2 hi-1 hi yi-3 yi-2 yi yi-1 π‘§π‘—βˆ’3

βˆ—

log π‘ž(π‘§π‘—βˆ’3) π‘§π‘—βˆ’2

βˆ—

log π‘ž(π‘§π‘—βˆ’2) 𝐹𝑗 = 𝑧𝑗

βˆ— log π‘ž(𝑧𝑗)

π‘§π‘—βˆ’1

βˆ—

log π‘ž(π‘§π‘—βˆ’1)

slide-73
SLIDE 73

BPTT

β„Žπ‘— = tanh(π‘‹β„Žπ‘—βˆ’1 + 𝑉𝑦𝑗) 𝑧𝑗 = softmax(π‘‡β„Žπ‘—)

per-step loss: cross entropy πœ–πΉπ‘— πœ–π‘‹ = πœ–πΉπ‘— πœ–π‘§π‘— πœ–π‘§π‘— πœ–π‘‹ = πœ–πΉπ‘— πœ–π‘§π‘— πœ–π‘§π‘— πœ–β„Žπ‘— πœ–β„Žπ‘— πœ–π‘‹ πœ–β„Žπ‘— πœ–π‘‹ = tanhβ€² π‘‹β„Žπ‘—βˆ’1 + 𝑉𝑦𝑗 πœ–π‘‹β„Žπ‘—βˆ’1 πœ–π‘‹ = tanhβ€² π‘‹β„Žπ‘—βˆ’1 + 𝑉𝑦𝑗 β„Žπ‘—βˆ’1 + 𝑋 πœ–β„Žπ‘—βˆ’1 πœ–π‘‹ xi-3 xi-2 xi xi-1 hi-3 hi-2 hi-1 hi yi-3 yi-2 yi yi-1 π‘§π‘—βˆ’3

βˆ—

log π‘ž(π‘§π‘—βˆ’3) π‘§π‘—βˆ’2

βˆ—

log π‘ž(π‘§π‘—βˆ’2) 𝐹𝑗 = 𝑧𝑗

βˆ— log π‘ž(𝑧𝑗)

π‘§π‘—βˆ’1

βˆ—

log π‘ž(π‘§π‘—βˆ’1)

slide-74
SLIDE 74

BPTT

xi-3 xi-2 xi xi-1 hi-3 hi-2 hi-1 hi yi-3 yi-2 yi yi-1 π‘§π‘—βˆ’3

βˆ—

log π‘ž(π‘§π‘—βˆ’3) π‘§π‘—βˆ’2

βˆ—

log π‘ž(π‘§π‘—βˆ’2) 𝐹𝑗 = 𝑧𝑗

βˆ— log π‘ž(𝑧𝑗)

π‘§π‘—βˆ’1

βˆ—

log π‘ž(π‘§π‘—βˆ’1) β„Žπ‘— = tanh(π‘‹β„Žπ‘—βˆ’1 + 𝑉𝑦𝑗) 𝑧𝑗 = softmax(π‘‡β„Žπ‘—)

per-step loss: cross entropy

πœ–β„Žπ‘— πœ–π‘‹ = tanhβ€² π‘‹β„Žπ‘—βˆ’1 + 𝑉𝑦𝑗 β„Žπ‘—βˆ’1 + 𝑋 πœ–β„Žπ‘—βˆ’1 πœ–π‘‹ = πœ€π‘—β„Žπ‘—βˆ’1 + πœ€π‘—π‘‹Γ°β„Žπ‘—βˆ’1 β„Žπ‘—βˆ’2 + 𝑋 πœ–β„Žπ‘—βˆ’2 πœ–π‘‹ πœ–πΉπ‘— πœ–π‘‹ = πœ–πΉπ‘— πœ–π‘§π‘— πœ–π‘§π‘— πœ–β„Žπ‘— πœ–β„Žπ‘— πœ–π‘‹ = Γ°β„Žπ‘— πœ–β„Žπ‘— πœ–π‘‹ = πœ€π‘š

(𝑗)

πœ€π‘š

(𝑗) = πœ–πΉπ‘—

πœ–π‘§π‘— πœ–π‘§π‘— πœ–β„Žπ‘— πœ–β„Žπ‘— πœ–β„Žπ‘š πœ–β„Žπ‘š πœ–π‘‹

slide-75
SLIDE 75

BPTT

πœ–β„Žπ‘— πœ–π‘‹ = tanhβ€² π‘‹β„Žπ‘—βˆ’1 + 𝑉𝑦𝑗 β„Žπ‘—βˆ’1 + 𝑋 πœ–β„Žπ‘—βˆ’1 πœ–π‘‹ = tanhβ€² π‘‹β„Žπ‘—βˆ’1 + 𝑉𝑦𝑗 β„Žπ‘—βˆ’1 + tanhβ€² π‘‹β„Žπ‘—βˆ’1 + 𝑉𝑦𝑗 𝑋tanhβ€² π‘‹β„Žπ‘—βˆ’2 + π‘‰π‘¦π‘—βˆ’1 β„Žπ‘—βˆ’2 + 𝑋 πœ–β„Žπ‘—βˆ’2 πœ–π‘‹ = ෍

π‘˜

πœ–πΉπ‘— πœ–π‘§π‘— πœ–π‘§π‘— πœ–β„Žπ‘— πœ–β„Žπ‘— πœ–β„Žπ‘š πœ–β„Žπ‘š πœ–π‘‹(π‘š) = ෍

π‘˜

πœ€

π‘˜ (𝑗) πœ–β„Žπ‘š

πœ–π‘‹(π‘š) πœ€π‘š

(𝑗) = πœ–πΉπ‘—

πœ–π‘§π‘— πœ–π‘§π‘— πœ–β„Žπ‘— πœ–β„Žπ‘— πœ–β„Žπ‘š per-loss, per-step backpropagation error

slide-76
SLIDE 76

BPTT

xi-3 xi-2 xi xi-1 hi-3 hi-2 hi-1 hi yi-3 yi-2 yi yi-1 π‘§π‘—βˆ’3

βˆ—

log π‘ž(π‘§π‘—βˆ’3) π‘§π‘—βˆ’2

βˆ—

log π‘ž(π‘§π‘—βˆ’2) 𝐹𝑗 = 𝑧𝑗

βˆ— log π‘ž(𝑧𝑗)

π‘§π‘—βˆ’1

βˆ—

log π‘ž(π‘§π‘—βˆ’1) β„Žπ‘— = tanh(π‘‹β„Žπ‘—βˆ’1 + 𝑉𝑦𝑗) 𝑧𝑗 = softmax(π‘‡β„Žπ‘—)

per-step loss: cross entropy

πœ–πΉπ‘— πœ–π‘‹ = ෍

π‘˜

πœ–πΉπ‘— πœ–π‘§π‘— πœ–π‘§π‘— πœ–β„Žπ‘— πœ–β„Žπ‘— πœ–π‘‹(π‘˜) hidden chain rule compact form

slide-77
SLIDE 77

Why Is Training RNNs Hard?

Vanishing gradients Multiply the same matrices at each timestep βž” multiply many matrices in the gradients

slide-78
SLIDE 78

The Vanilla RNN Backward

h1

x1 h0

C1

y1 h2 x2 h1

C2

y2 h3 x3 h2

C3

y3

ht = tanhW xt ht-1 Γ¦ Γ¨ Γ§ ΓΆ ΓΈ Γ· yt = F(ht ) Ct = Loss(yt,GT

t )

Slide credit: Arun Mallya

slide-79
SLIDE 79

Vanishing Gradient Solution: Motivation

ht = ht-1 + F(xt) Identity The gradient does not decay as the error is propagated all the way back aka β€œConstant Error Flow” Þ ΒΆht ΒΆht-1 Γ¦ Γ¨ Γ§ ΓΆ ΓΈ Γ· = 1

ht = tanhW xt ht-1 Γ¦ Γ¨ Γ§ ΓΆ ΓΈ Γ· yt = F(ht ) Ct = Loss(yt,GT

t )

Slide credit: Arun Mallya

slide-80
SLIDE 80

Vanishing Gradient Solution: Model Implementations

LSTM: Long Short-Term Memory (Hochreiter & Schmidhuber, 1997) GRU: Gated Recurrent Unit (Cho et al., 2014) Basic Ideas: learn to forget

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

forget line representation line

slide-81
SLIDE 81

Long Short-Term Memory (LSTM): Hochreiter et al., (1997)

Create a β€œConstant Error Carousel” (CEC) which ensures that gradients don’t decay A memory cell that acts like an accumulator (contains the identity relationship) over time

it

  • t

ft Input Gate Output Gate Forget Gate ht

xt ht-1

Cell ct xt ht-1 xt ht-1 W Wi Wo Wf

𝑑𝑒 = 𝑔

𝑒 βŠ— π‘‘π‘’βˆ’1 + 𝑗𝑒 βŠ— tanh 𝑋

𝑦𝑒 β„Žπ‘’βˆ’1 𝑔

𝑒 = 𝜏 𝑋 𝑔

𝑦𝑒 β„Žπ‘’βˆ’1 + 𝑐𝑔

xt ht-1

Slide credit: Arun Mallya

slide-82
SLIDE 82

I want to use CNNs/RNNs/Deep Learning in my

  • project. I don’t want to do this all by hand.
slide-83
SLIDE 83

Defining A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

slide-84
SLIDE 84

Defining A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

slide-85
SLIDE 85

Defining A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

encode

slide-86
SLIDE 86

Defining A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

decode

slide-87
SLIDE 87

Training A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

slide-88
SLIDE 88

Training A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

Negative log- likelihood

slide-89
SLIDE 89

Training A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

Negative log- likelihood get predictions

slide-90
SLIDE 90

Training A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

Negative log- likelihood get predictions eval predictions

slide-91
SLIDE 91

Training A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

Negative log- likelihood get predictions eval predictions compute gradient

slide-92
SLIDE 92

Training A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

Negative log- likelihood get predictions eval predictions compute gradient perform SGD

slide-93
SLIDE 93

Slide Credit

http://slazebni.cs.illinois.edu/spring17/lec01_cnn_architectures.pdf http://slazebni.cs.illinois.edu/spring17/lec02_rnn.pdf