Lecture 10: Recurrent Neural Networks CS109B Data Science 2 Pavlos - - PowerPoint PPT Presentation

lecture 10 recurrent neural networks
SMART_READER_LITE
LIVE PREVIEW

Lecture 10: Recurrent Neural Networks CS109B Data Science 2 Pavlos - - PowerPoint PPT Presentation

Lecture 10: Recurrent Neural Networks CS109B Data Science 2 Pavlos Protopapas and Mark Glickman Sequence Modeling: Handwritten Text Translation Input : Image Output: Text


slide-1
SLIDE 1

CS109B Data Science 2

Pavlos Protopapas and Mark Glickman

Lecture 10: Recurrent Neural Networks

slide-2
SLIDE 2

CS109B, PROTOPAPAS, GLICKMAN

Sequence Modeling: Handwritten Text Translation

2

  • Input : Image
  • Output: Text

https://towardsdatascience.com/build-a-handwritten-text-recognition-system-using-tensorflow- 2326a3487cd5

slide-3
SLIDE 3

CS109B, PROTOPAPAS, GLICKMAN

Sequence Modeling: Text-to-Speech

3

  • Input : Audio
  • Output: Text
slide-4
SLIDE 4

CS109B, PROTOPAPAS, GLICKMAN

Sequence Modeling: Machine Translation

4

  • Input : Text
  • Output: Translated Text
slide-5
SLIDE 5

CS109B, PROTOPAPAS, GLICKMAN

Rapping-neural-network

5

https://github.com/robbiebarrat/rapping-neural-network

slide-6
SLIDE 6

CS109B, PROTOPAPAS, GLICKMAN

Outline

Why RNNs Main Concept of RNNs More Details of RNNs RNN training Gated RNN

6

slide-7
SLIDE 7

CS109B, PROTOPAPAS, GLICKMAN

Outline

Why RNNs Main Concept of RNNs More Details of RNNs RNN training Gated RNN

7

slide-8
SLIDE 8

CS109B, PROTOPAPAS, GLICKMAN

What can my NN do?

8

NN [George, Mary, Tom, Suzie]

Training: Present to the NN examples and learn from them.

slide-9
SLIDE 9

CS109B, PROTOPAPAS, GLICKMAN

What can my NN do?

9

NN George NN Mary

Prediction: Given an example

slide-10
SLIDE 10

CS109B, PROTOPAPAS, GLICKMAN

What my NN can NOT do?

10

WHO IS IT?

?

slide-11
SLIDE 11

CS109B, PROTOPAPAS, GLICKMAN

Learn from previous examples

11

Time

slide-12
SLIDE 12

CS109B, PROTOPAPAS, GLICKMAN

Recurrent Neural Network (RNN)

12

NN George

slide-13
SLIDE 13

CS109B, PROTOPAPAS, GLICKMAN

Recurrent Neural Network (RNN)

13

NN George I have seen George moving in this way before. RNNs recognize the data's sequential characteristics and use patterns to predict the next likely scenario.

slide-14
SLIDE 14

CS109B, PROTOPAPAS, GLICKMAN

Recurrent Neural Network (RNN)

Our model requires context - or contextual information - to understand the subject (he) and the direct object (it) in the sentence.

14

WHO IS HE?

I do not know. I need to know who said that and what he said before. Can you tell me more? He told me I could have it

slide-15
SLIDE 15

CS109B, PROTOPAPAS, GLICKMAN

RNN – Another Example with Text

After providing sequential information, the model understood the subject (Joe’s brother) and the direct object (sweater) in the sentence .

15

WHO IS HE?

I see what you mean now! The noun “he” stands for Joe’s brother while ”it” for the sweater.

  • Hellen: Nice sweater Joe.
  • Joe: Thanks, Hellen. It used

to belong to my brother and he told me I could have it.

slide-16
SLIDE 16

CS109B, PROTOPAPAS, GLICKMAN

Sequences

  • We want a machine learning model to understand sequences, not isolated

samples.

  • Can MLP do this?
  • Assume we have a sequence of temperature measurements and we want to take 3

sequential measurements and predict the next one

16

35 32 45 48 41 39 36 … 1 2 3 4 5 6 7 … features samples

slide-17
SLIDE 17

CS109B, PROTOPAPAS, GLICKMAN

Sequences

  • We want a machine learning model to understand sequences, not isolated

samples.

  • Can MLP do this?
  • Assume we have a sequence of temperature measurements and we want to take 3

sequential measurements and predict the next one

17

35 32 45 48 41 39 36 … 1 2 3 4 5 6 7 … features samples 35 32 45 48 1 2 3 4

slide-18
SLIDE 18

CS109B, PROTOPAPAS, GLICKMAN

Sequences

  • We want a machine learning model to understand sequences, not isolated

samples.

  • Can MLP do this?
  • Assume we have a sequence of temperature measurements and we want to take 3

sequential measurements and predict the next one

18

35 32 45 48 41 39 36 … 1 2 3 4 5 6 7 … features samples 35 32 45 48 1 2 3 4 32 45 48 41 2 3 4 5

slide-19
SLIDE 19

CS109B, PROTOPAPAS, GLICKMAN

Sequences

  • We want a machine learning model to understand sequences, not isolated

samples.

  • Can MLP do this?
  • Assume we have a sequence of temperature measurements and we want to take 3

sequential measurements and predict the next one

19

35 32 45 48 41 39 36 … 1 2 3 4 5 6 7 … features samples 35 32 45 48 1 2 3 4 32 45 48 41 2 3 4 5 45 48 41 39 3 4 5 6

slide-20
SLIDE 20

CS109B, PROTOPAPAS, GLICKMAN

Sequences

  • We want a machine learning model to understand sequences, not isolated

samples.

  • Can MLP do this?
  • Assume we have a sequence of temperature measurements and we want to take 3

sequential measurements and predict the next one

20

35 32 45 48 41 39 36 … 1 2 3 4 5 6 7 … features samples 35 32 45 48 1 2 3 4 32 45 48 41 2 3 4 5 45 48 41 39 3 4 5 6

slide-21
SLIDE 21

CS109B, PROTOPAPAS, GLICKMAN

Windowed dataset

This is called overlapping windowed dataset, since we’re windowing observations to create new. We can easily do using a MLS:

21

3 1

10 ReLU 10 ReLU 1 ReLU 45 35 32 3 1 2 4 32 48 45 2 4 3 5 41 48 45 5 4 3 6

will produce the same results But re-arranging the order of the inputs like:

slide-22
SLIDE 22

CS109B, PROTOPAPAS, GLICKMAN

Why not CNNs or MLPs?

22

1. MLPs/CNNs require fixed input and output size

  • 2. MLPs/CNNs can’t classify inputs in multiple places
slide-23
SLIDE 23

CS109B, PROTOPAPAS, GLICKMAN

Windowed dataset

What follows after: ‘I got in the car and’ ? drove away What follows after: ‘In car the and I’ ? Not obvious it should be ‘drove away’ The order of words matters. This is true for most sequential data. A fully connected network will not distinguish the order and therefore missing some information.

23

slide-24
SLIDE 24

CS109B, PROTOPAPAS, GLICKMAN

Outline

Why RNNs Main Concept of RNNs More Details of RNNs RNN training Gated RNN

24

slide-25
SLIDE 25

CS109B, PROTOPAPAS, GLICKMAN

Memory

Somehow the computational unit should remember what it has seen before.

25

Unit 𝑌" 𝑍

"

Should remember 𝑌$ … 𝑌"&'

slide-26
SLIDE 26

CS109B, PROTOPAPAS, GLICKMAN

Memory

Somehow the computational unit should remember what it has seen before.

26

Unit Internal memory

𝑌" 𝑍

"

slide-27
SLIDE 27

CS109B, PROTOPAPAS, GLICKMAN

Memory

Somehow the computational unit should remember what it has seen before. We’ll call the information the unit’s state.

27

RNN Internal memory

𝑌" 𝑍

"

slide-28
SLIDE 28

CS109B, PROTOPAPAS, GLICKMAN

Memory

In neural networks, once training is over, the weights do not change. This means that the network is done learning and done changing. Then, we feed in values, and it simply applies the operations that make up the network, using the values it has learned. But the RNN units are able to remember new information after training has completed. That is, they’re able to keep changing after training is over.

28

slide-29
SLIDE 29

CS109B, PROTOPAPAS, GLICKMAN

Memory

Question: How can we do this? How can build a unit that remembers the past? The memory or state can be written to a file but in RNNs, we keep it inside the recurrent unit. In an array or in a vector! Work with an example: Anna Sofia said her shoes are too ugly. Her here means Anna Sofia. Nikolas put his keys on the table. His here means Nikolas

29

slide-30
SLIDE 30

CS109B, PROTOPAPAS, GLICKMAN

Memory

Question: How can we do this? How can build a unit that remembers the past? The memory or state can be written to a file but in RNNs, we keep it inside the recurrent unit. In an array or in a vector!

30

RNN

𝑌" (𝑓. 𝑕. ℎ𝑗𝑡) Memory

RNN

𝑍

" (𝑓. 𝑕. 𝑂𝑗𝑙𝑝𝑚𝑏𝑡)

Memory

slide-31
SLIDE 31

CS109B, PROTOPAPAS, GLICKMAN

Building an RNN

31

RNN

𝑍

"

Memory 𝑌" Memory

RNN

𝑍

"

Memory

𝑌"

Memory

RNN

𝑍

"6'

Memory

𝑌"6'

RNN

𝑍

"67

Memory

𝑌"67

RNN

𝑍

"68

Memory

𝑌"68

slide-32
SLIDE 32

CS109B, PROTOPAPAS, GLICKMAN

Outline

Why RNNs Main Concept of RNNs More Details of RNNs RNN training Gated RNN

32

slide-33
SLIDE 33

CS109B, PROTOPAPAS, GLICKMAN

Structure of an RNN cell

33

RNN

𝑍

"

State 𝑌"

RNN

𝑍

"

State 𝑌"

  • utput

weight update weight input weight

slide-34
SLIDE 34

CS109B, PROTOPAPAS, GLICKMAN

Image taken from A. Glassner, Deep Learning, Vol. 2: From Basics to Practice 34

slide-35
SLIDE 35

CS109B, PROTOPAPAS, GLICKMAN

Outline

Why RNNs Main Concept of RNNs More Details of RNNs RNN training Gated RNN

35

slide-36
SLIDE 36

CS109B, PROTOPAPAS, GLICKMAN

Backprop Through Time

  • For each input, unfold network for the sequence length T
  • Back-propagation: apply forward and backward pass on unfolded

network

  • Memory cost: O(T)

36

slide-37
SLIDE 37

CS109B, PROTOPAPAS, GLICKMAN

Backprop Through Time

37

RNN

𝑍

"

State 𝑌"

  • utput

weight update weight input weight

slide-38
SLIDE 38

CS109B, PROTOPAPAS, GLICKMAN

Backprop Through Time

38

RNN

𝑍

"

State 𝑌"

  • utput

weight update weight

slide-39
SLIDE 39

CS109B, PROTOPAPAS, GLICKMAN

Backprop Through Time

39

RNN

𝑍

"

State 𝑌" Update Weights: U Output Weights: W Input Weights: V

ℎ"

slide-40
SLIDE 40

CS109B, PROTOPAPAS, GLICKMAN

Backprop Through Time

40

You have two activation functions 𝑕9 which serves as the activation for the hidden state and 𝑕: which is the activation of the output. In the example shown before 𝑕: was the identity.

ℎ"&7 𝑌"&7 V U W ℎ"&' 𝑌"&' V U W ℎ" 𝑌" V U W 𝑧 <t-2 𝑧 <t-1 𝑧 <t

slide-41
SLIDE 41

CS109B, PROTOPAPAS, GLICKMAN

Backprop Through Time

41

ℎ"&7 𝑌"&7 V U W ℎ"&' 𝑌"&' V U W ℎ" 𝑌" V U W 𝑧 <t-2 𝑧 <t-1 𝑧 <t

slide-42
SLIDE 42

CS109B, PROTOPAPAS, GLICKMAN

Backprop Through Time

42

𝜖𝑧 <" 𝜖𝑋 = 𝑕:′ℎ"

𝑀 = B 𝑀"

  • "

𝑒𝑀 𝑒𝑋 = B 𝑒𝑀" 𝑒𝑋

  • "

= B 𝜖𝑀" 𝜖𝑧 <" 𝜖𝑧 <" 𝜖𝑋

  • "

𝑀" = 𝑀"(𝑧 <")

𝑧 <" = 𝑕: 𝑋ℎ" + 𝑐

slide-43
SLIDE 43

CS109B, PROTOPAPAS, GLICKMAN

43

slide-44
SLIDE 44

CS109B, PROTOPAPAS, GLICKMAN

Backprop Through Time

44

𝑒𝑀 𝑒𝑉 = B 𝜖𝑀" 𝜖𝑧 <" 𝜖𝑧 <" 𝜖ℎ" 𝜖ℎ" 𝜖𝑉

  • "

𝑧 <" = 𝑕: 𝑋ℎ" + 𝑐

𝑀 = B 𝑀"

  • "

𝑀" = 𝑀"(𝑧 <")

ℎ" = 𝑕9 𝑊𝑦" + 𝑉ℎ"&' + 𝑐J

𝑧 <" = 𝑕:(𝑋𝑕9 𝑊𝑦" + 𝑉ℎ"&' + 𝑐J + 𝑐) K9L KM=∑ K9L K9O K9O KM " PQ' 𝜖𝑀" 𝜖𝑉 = 𝜖𝑀" 𝜖𝑧 <" 𝜖𝑧 <" 𝜖ℎ" (𝑒ℎ" 𝑒𝑉 + 𝑒ℎ" 𝑒ℎ"&' 𝑒ℎ"&' 𝑒𝑉 + 𝑒ℎ" 𝑒ℎ"&' 𝑒ℎ"&' 𝑒ℎ"&7 𝑒ℎ"&7 𝑒𝑉 + ⋯ ) K9L K9O= K9L K9LST K9LST K9LSU … K9OVT K9O = ∏ K9X K9XST " YQP6' 𝜖ℎY 𝜖ℎY&' = g[

J U

slide-45
SLIDE 45

CS109B, PROTOPAPAS, GLICKMAN

Gradient Clipping

Prevents exploding gradients Clip the norm of gradient before update. For some derivative 𝑕, and some threshold u

45

if 𝑕 > 𝑣 𝑕 ⟵ 𝑕𝑣 𝑕

slide-46
SLIDE 46

CS109B, PROTOPAPAS, GLICKMAN

Gradient Clipping

46

slide-47
SLIDE 47

CS109B, PROTOPAPAS, GLICKMAN

Outline

Why RNNs Main Concept of RNNs More Details of RNNs RNN training Gated RNN

47

slide-48
SLIDE 48

CS109B, PROTOPAPAS, GLICKMAN

Long-term Dependencies

Unfolded networks can be very deep Long-term interactions are given exponentially smaller weights than small-term interactions Gradients tend to either vanish or explode

48

slide-49
SLIDE 49

CS109B, PROTOPAPAS, GLICKMAN

Long Short-Term Memory

Handles long-term dependencies Leaky units where weight on self-loop α is context-dependent Allow network to decide whether to accumulate or forget past info

49

slide-50
SLIDE 50

CS109B, PROTOPAPAS, GLICKMAN

Notation

Using conventional and convenient notation

50

𝑍

"

𝑌"

slide-51
SLIDE 51

CS109B, PROTOPAPAS, GLICKMAN

Simple RNN again

51

State V + W σ σ U 𝑌" 𝑍

"

ℎ"

slide-52
SLIDE 52

CS109B, PROTOPAPAS, GLICKMAN

Simple RNN again

52

State V + W σ σ U 𝑌" 𝑍

"

ℎ"

slide-53
SLIDE 53

CS109B, PROTOPAPAS, GLICKMAN

Simple RNN again: Memories

State V + W σ σ U 𝑌" 𝑍

"

ℎ"

slide-54
SLIDE 54

CS109B, PROTOPAPAS, GLICKMAN

Simple RNN again: Memories - Forgetting

State V + W σ σ U 𝑌" 𝑍

"

ℎ"

slide-55
SLIDE 55

CS109B, PROTOPAPAS, GLICKMAN

Simple RNN again: New Events

State V + W σ σ U 𝑌" 𝑍

"

ℎ"

slide-56
SLIDE 56

CS109B, PROTOPAPAS, GLICKMAN

Simple RNN again: New Events Weighted

State V + W σ σ U 𝑌" 𝑍

"

ℎ"

slide-57
SLIDE 57

CS109B, PROTOPAPAS, GLICKMAN

Simple RNN again: Updated memories

State V + W σ σ U 𝑌" 𝑍

"

ℎ"

slide-58
SLIDE 58

CS109B, PROTOPAPAS, GLICKMAN

Ref

  • [Chen17b] Qiming Chen, Ren Wu, “CNN Is All You Need”, arXiv 1712.09662, 2017.

https://arxiv.org/abs/1712.09662

  • [Chu17] Hang Chu, Raquel Urtasun, Sanja Fidler, “Song From PI: A Musically Plausible

Network for Pop Music Generation”, arXiv preprint, 2017. https://arxiv.org/abs/1611.03477

  • [Johnson17] Daniel Johnson, “Composing Music with Recurrent Neural Networks”,

Heahedria, 2017. http://www.hexahedria.com/2015/08/03/ composing-music-with- recurrent-neural-networks/

  • [Deutsch16b] Max Deutsch, “Silicon Valley: A New Episode Written by AI”, Deep

Writing blog post, 2017. https://medium.com/deep-writing/ silicon-valley-a-new- episode-written-by-ai-a8f832645bc2

  • [Fan16] Bo Fan, Lijuan Wang, Frank K. Soong, Lei Xie “Photo-Real Talking Head with

Deep Bidirectional LSTM”, Multimedia Tools and Applications, 75(9), 2016. https://www.microsoft.com/en-us/research/wp- content/uploads/2015/04/icassp2015_fanbo_1009.pdf

58

slide-59
SLIDE 59

CS109B, PROTOPAPAS, GLICKMAN

59

Continue on Wednesday

slide-60
SLIDE 60

CS109B, PROTOPAPAS, GLICKMAN

RNN

RNN

0.3 dog barking

  • dog barking
  • RNN

0.1 white shirt

dog barking white shirt

RNN

0.1 apple pie

white shirt apple pie

RNN

0.4 knee hurts

apple pie knee hurts

RNN

0.6 get dark

Is it raining? We build an RNN to the probability if it is raining:

slide-61
SLIDE 61

CS109B, PROTOPAPAS, GLICKMAN

RNN + Memory

RNN

0.3

RNN

0.1

RNN

0.1

RNN

0.4

RNN

0.6

RNN

0.3

  • RNN

0.1

RNN

0.1

RNN

0.6

RNN

0.9 dog barking white shirt apple pie knee hurts get dark dog barking white shirt apple pie knee hurts get dark

  • dog barking

dog barking white shirt white shirt apple pie apple pie knee hurts

  • dog barking

dog barking white shirt dog barking apple pie dog barking knee hurts

slide-62
SLIDE 62

CS109B, PROTOPAPAS, GLICKMAN

RNN + Memory + Output

dog barking white shirt apple pie knee hurts get dark

RNN

0.3

  • RNN

0.1

RNN

0.1

RNN

0.6

RNN

0.9

  • dog barking

dog barking apple pie dog barking white shirt dog barking knee hurts

slide-63
SLIDE 63

CS109B, PROTOPAPAS, GLICKMAN

63

LSTM: Long short term memory

slide-64
SLIDE 64

CS109B, PROTOPAPAS, GLICKMAN

64

Before to really understand LSTM lets see the big picture …

Forget Gate Input Gate Cell State Output Gate

slide-65
SLIDE 65

CS109B, PROTOPAPAS, GLICKMAN

65

1.

LSTM are recurrent neural network with a cell and a hidden state, boths of these are updated in each step and can be thought as memories.

2.

Cell states work as a long term memory and the updates depends on the relation between the hidden state in t -1 and the input.

3.

The hidden state of the next step is a transformation of the cell state and the

  • utput (which is the section that is in general

used to calculate our loss, ie information that we want in a short memory).

Before to really understand LSTM lets see the big picture …

slide-66
SLIDE 66

CS109B, PROTOPAPAS, GLICKMAN

66

Let's think about my cell state

Let's predict if i will help you in the homework in time t

slide-67
SLIDE 67

CS109B, PROTOPAPAS, GLICKMAN

67

Forget Gate Erase everything!

The forget gate tries to estimate what features of the cell state should be forgotten.

slide-68
SLIDE 68

CS109B, PROTOPAPAS, GLICKMAN

68

Input Gate

The input gate layer works in a similar way that the forget layer, the input gate layer estimate the degree of confidence of and is a new estimation of the cell state. Let’s say that my input gate estimation is:

slide-69
SLIDE 69

CS109B, PROTOPAPAS, GLICKMAN

Cell state

After the calculation of forget gate and input gate we can update our cell state.

69

slide-70
SLIDE 70

CS109B, PROTOPAPAS, GLICKMAN

70

Output gate

  • The output gate layer is calculated using the

information of the input x in time t and hidden state

  • f the last step.
  • It is important to notice that hidden state used in

the next step is obtained using the output gate layer which is usually the function that we optimize.

slide-71
SLIDE 71

CS109B, PROTOPAPAS, GLICKMAN

71

To optimize my parameters i basically need to do: Let’s calculate all the derivatives in some time t! wcct! = we can calculate this!

wcct! wcct! wcct! wcct!

So… every derivative is wrt the cell state or the hidden state

slide-72
SLIDE 72

CS109B, PROTOPAPAS, GLICKMAN

72

Let’s calculate the cell state and the hidden state

slide-73
SLIDE 73

CS109B, PROTOPAPAS, GLICKMAN

RNN Structures

73

𝑍

"

𝑌"

  • ne to one
  • The one to one structure is useless.
  • It takes a single input and it produces a single
  • utput.
  • Not useful because the RNN cell is making little

use of its unique ability to remember things about its input sequence

slide-74
SLIDE 74

CS109B, PROTOPAPAS, GLICKMAN

RNN Structures (cont)

74

𝑌"&7 many to one 𝑌"&' 𝑍

"

𝑌"

The many to one structure reads in a sequence and gives us back a single value. Example: Sentiment analysis, where the network is given a piece of text and then reports

  • n some quality inherent in the
  • writing. A common example is

to look at a movie review and determine if it was positive or negative.

slide-75
SLIDE 75

CS109B, PROTOPAPAS, GLICKMAN

RNN Structures (cont)

75

𝑍

"&7

𝑌"&7

  • ne to many

𝑍

"&'

𝑍

"

The one to many takes in a single piece of data and produces a sequence. For example we give it the starting note for a song, and the network produces the rest of the melody for us.

slide-76
SLIDE 76

CS109B, PROTOPAPAS, GLICKMAN

RNN Structures (cont)

76

𝑍

"&7

𝑌"&7 many to many 𝑍

"&'

𝑌"&' 𝑍

"

𝑌"

The many to many structures are in some ways the most interesting. used for machine translation. Example: Predict if it will rain given some inputs.

slide-77
SLIDE 77

CS109B, PROTOPAPAS, GLICKMAN

RNN Structures (cont)

77

𝑌"&7 many to many 𝑍

"&'

𝑌"&' 𝑍

"

This form of many to many can be used for machine translation. For example, the English sentence: “The black dog jumped over the cat” In Italian as: “Il cane nero saltò sopra il gatto” In the Italia, the adjective “nero” (black) follows the noun “cane” (dog), so we need to have some kind of buffer so we can produce the words in their proper English.

slide-78
SLIDE 78

CS109B, PROTOPAPAS, GLICKMAN

Bidirectional

LSTM and RNN are designed to analyze sequence of values. For example: Patrick said he needs a vacation. he here means Patrick and we know this because Patrick was before the word he. However consider the following sentence: He needs to work more, Pavlos said about Patrick. Bidirectional RNN or BRNN or bidirectional LSTM or BLSTM when using LSTM units.

78

slide-79
SLIDE 79

CS109B, PROTOPAPAS, GLICKMAN

Bidirectional (cond)

79

𝑍

"&7

𝑍

"&'

𝑍

"

𝑌"&7 𝑌"&' 𝑌" previous state previous state 𝑍

"

𝑌" symbol for a BRNN

slide-80
SLIDE 80

CS109B, PROTOPAPAS, GLICKMAN

Deep RNN

LSTM units can be arranged in layers, so that each the output of each unit is the input to the other units. This is called a deep RNN, where the adjective “deep” refers to these multiple layers.

  • Each layer feeds the LSTM on the next layer
  • First time step of a feature is fed to the first LSTM, which processes

that data and produces an output (and a new state for itself).

  • That output is fed to the next LSTM, which does the same thing, and

the next, and so on.

  • Then the second time step arrives at the first LSTM, and the process

repeats.

80

slide-81
SLIDE 81

CS109B, PROTOPAPAS, GLICKMAN

Deep RNN

81

𝑌"&7 𝑌"&' 𝑌" 𝑌"6' 𝑌"67 𝑍

"&7

𝑍

"&'

𝑍

"

𝑍

"6'

𝑍

"67

𝑌 𝑍

slide-82
SLIDE 82

CS109B, PROTOPAPAS, GLICKMAN

Skip Connections

Add additional connections between units d time steps apart Creating paths through time where gradients neither vanish or explode

82

t-1 t t+1

slide-83
SLIDE 83

CS109B, PROTOPAPAS, GLICKMAN

Leaky Units

Linear self-connections Maintain cell state: running average of past hidden activations

83

slide-84
SLIDE 84

CS109B, PROTOPAPAS, GLICKMAN

Standard RNN

84

C(t) = tanh(Wh(t−1) + Ux(t−1)) h(t) = C(t) C(t)

colah.github.io

slide-85
SLIDE 85

CS109B, PROTOPAPAS, GLICKMAN

Leaky Unit

85

C(t) = tanh(Wh(t−1) + Ux(t−1)) h(t) = αh(t−1) + (1-α)C(t) C(t)

colah.github.io