Lecture 14: Recurrent Neural Networks CS109B Data Science 2 Pavlos - - PowerPoint PPT Presentation

lecture 14 recurrent neural networks
SMART_READER_LITE
LIVE PREVIEW

Lecture 14: Recurrent Neural Networks CS109B Data Science 2 Pavlos - - PowerPoint PPT Presentation

Lecture 14: Recurrent Neural Networks CS109B Data Science 2 Pavlos Protopapas, Mark Glickman, and Chris Tanner Online lectures guidelines We would prefer you have your video on, but it is OK if you have it off. We would prefer you have


slide-1
SLIDE 1

CS109B Data Science 2

Pavlos Protopapas, Mark Glickman, and Chris Tanner

Lecture 14: Recurrent Neural Networks

slide-2
SLIDE 2

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Online lectures guidelines

  • We would prefer you have your video on, but it is OK if you have

it off.

  • We would prefer you have your real name.
  • All lectures, labs and a-sections will be live streamed and als

available for viewing later on canvas/zoom.

  • We will have course staff in the chat online and during lecture

you can also make use of this spreadsheet to enter your own questions or 'up vote' those of your fellow students.

  • Quizzed will be available for 24 hours.

2

slide-3
SLIDE 3

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Outline

Why Recurrent Neural Networks (RNNs) Main Concept of RNNs More Details of RNNs RNN training Gated RNN

3

slide-4
SLIDE 4

CS109B, PROTOPAPAS, GLICKMAN, TANNER

4

CS109B, PROTOPAPAS, GLICKMAN, TANNER

1

slide-5
SLIDE 5

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Background

Many classification and regression tasks involve data that is assumed to be in independent and id identica ically ly dis istrib ibuted (i. i.i. i.d.). .). For example:

5

De Detecting l lung c cancer Fa Face ce reco ecogni nition Ri Risk of heart attack

slide-6
SLIDE 6

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Background

Much of our data is inherently se sequential

6

WORLD LD HU HUMA MANITY INDIVIDUAL L PEOPLE LE

Na Natural disasters (e.g., earthquakes) Cl Climate c change St Stock ck market ket Vi Virus outbreaks Sp Speech eech reco ecogni nition Ma Machine Tra ranslation (e.g., Engli lish -> F > French) Ca Cancer t treatment

scale examples

slide-7
SLIDE 7

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Much of our data is inherently seq sequenti uential

PR PREDICTI TING EA EARTHQU HQUAKES KES

7

Background

slide-8
SLIDE 8

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Much of our data is inherently seq sequenti uential

STO STOCK MA MARKET PR PREDICTI TIONS

8

Background

slide-9
SLIDE 9

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Much of our data is inherently seq sequenti uential

SPE SPEECH RECOGNITI TION

“What is the weather today?” “What is the weather two day?” “What is the whether too day?” “What is, the Wrether to Dae?”

9

Background

slide-10
SLIDE 10

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequence Modeling: Handwritten Text

10

  • Input : Image
  • Output: Text

https://towardsdatascience.com/build-a-handwritten-text-recognition-system-using-tensorflow- 2326a3487cd5

slide-11
SLIDE 11

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequence Modeling: Text-to-Speech

11

  • Input : Text
  • Output: Audio
slide-12
SLIDE 12

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequence Modeling: Machine Translation

12

  • Input : Text
  • Output: Translated Text
slide-13
SLIDE 13

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Outline

Why RNNs Main Concept of RNNs (part 1) More Details of RNNs RNN training Gated RNN

13

slide-14
SLIDE 14

CS109B, PROTOPAPAS, GLICKMAN, TANNER

14

slide-15
SLIDE 15

CS109B, PROTOPAPAS, GLICKMAN, TANNER

What can my NN do?

15

NN [George, Mary, Tom, Suzie]

Training: Present to the NN examples and learn from them.

slide-16
SLIDE 16

CS109B, PROTOPAPAS, GLICKMAN, TANNER

What can my NN do?

16

NN George NN Mary Prediction: Given an example

slide-17
SLIDE 17

CS109B, PROTOPAPAS, GLICKMAN, TANNER

What my NN can NOT do?

17

WHO IS IT?

?

slide-18
SLIDE 18

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Learn from previous examples

18

Time

slide-19
SLIDE 19

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Recurrent Neural Network (RNN)

19

NN George

slide-20
SLIDE 20

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Recurrent Neural Network (RNN)

RNNs recognize the data's sequential characteristics and use patterns to predict the next likely scenario.

20

NN George I have seen George moving in this way before.

slide-21
SLIDE 21

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Recurrent Neural Network (RNN)

Our model requires context - or contextual information - to understand the subject (he) and the direct object (it) in the sentence.

21

WHO IS HE?

I do not know. I need to know who said that and what he said before. Can you tell me more? He told me I could have it

slide-22
SLIDE 22

CS109B, PROTOPAPAS, GLICKMAN, TANNER

RNN – Another Example with Text

After providing sequential information, the model recognize the subject (Joe’s brother) and the object (sweater) in the sentence.

22

WHO IS HE?

I see what you mean now! The noun “he” stands for Joe’s brother while ”it” for the sweater.

  • Hellen: Nice sweater Joe.
  • Joe: Thanks, Hellen. It used

to belong to my brother and he told me I could have it.

slide-23
SLIDE 23

CS109B, PROTOPAPAS, GLICKMAN, TANNER

23

Batch_size = 2048

slide-24
SLIDE 24

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequences

  • We want a machine learning model to understand sequences, not isolated

samples.

  • Can MLP do this?
  • Assume we have a sequence of temperature measurements and we want to take 3

sequential measurements and predict the next one

24

35 32 45 48 41 39 36 … 1 2 3 4 5 6 7 … features samples

slide-25
SLIDE 25

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequences

  • We want a machine learning model to understand sequences, not isolated

samples.

  • Can MLP do this?
  • Assume we have a sequence of temperature measurements and we want to take 3

sequential measurements and predict the next one

25

35 32 45 48 41 39 36 … 1 2 3 4 5 6 7 … features samples 35 32 45 48 1 2 3 4

slide-26
SLIDE 26

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequences

  • We want a machine learning model to understand sequences, not isolated

samples.

  • Can MLP do this?
  • Assume we have a sequence of temperature measurements and we want to take 3

sequential measurements and predict the next one

26

35 32 45 48 41 39 36 … 1 2 3 4 5 6 7 … features samples 35 32 45 48 1 2 3 4 32 45 48 41 2 3 4 5

slide-27
SLIDE 27

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequences

  • We want a machine learning model to understand sequences, not isolated

samples.

  • Can MLP do this?
  • Assume we have a sequence of temperature measurements and we want to take 3

sequential measurements and predict the next one

27

35 32 45 48 41 39 36 … 1 2 3 4 5 6 7 … features samples 35 32 45 48 1 2 3 4 32 45 48 41 2 3 4 5 45 48 41 39 3 4 5 6

slide-28
SLIDE 28

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Sequences

  • We want a machine learning model to understand sequences, not isolated

samples.

  • Can MLP do this?
  • Assume we have a sequence of temperature measurements and we want to take 3

sequential measurements and predict the next one

28

35 32 45 48 41 39 36 … 1 2 3 4 5 6 7 … features samples 35 32 45 48 1 2 3 4 32 45 48 41 2 3 4 5 45 48 41 39 3 4 5 6

slide-29
SLIDE 29

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Windowed dataset

This is called overlapping windowed dataset, since we’re windowing observations to create new. We can easily do using a MLS:

29

3 1

10 ReLU 10 ReLU 1 ReLU 45 35 32 3 1 2 4 32 48 45 2 4 3 5 41 48 45 5 4 3 6

will produce the same results But re-arranging the order of the inputs like:

slide-30
SLIDE 30

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Why not CNNs or MLPs?

30

1. MLPs/CNNs require fixed input and output size

  • 2. MLPs/CNNs can’t classify inputs in multiple places
slide-31
SLIDE 31

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Windowed dataset

What follows after: `I got in the car and’ ? `drove away’ What follows after: `In car the and I got’ ? Not obvious that it should be `drove away’ The order of words matters. This is true for most sequential data. A fully connected network will not distinguish the order and therefore missing some information.

31

slide-32
SLIDE 32

CS109B, PROTOPAPAS, GLICKMAN, TANNER

32

slide-33
SLIDE 33

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Outline

Why RNNs Main Concept of RNNs More Details of RNNs RNN training Gated RNN

33

slide-34
SLIDE 34

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Memory

Somehow the computational unit should remember what it has seen before.

34

Unit 𝑌" 𝑍

"

Should remember 𝑌$ … 𝑌"&'

slide-35
SLIDE 35

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Memory

Somehow the computational unit should remember what it has seen before.

35

Unit Internal memory

𝑌" 𝑍

"

slide-36
SLIDE 36

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Memory

Somehow the computational unit should remember what it has seen before. We’ll call the information the unit’s state.

36

RNN Internal memory

𝑌" 𝑍

"

slide-37
SLIDE 37

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Memory

In neural networks, once training is over, the weights do not change. This means that the network is done learning and done changing. Then, we feed in values, and it simply applies the operations that make up the network, using the values it has learned. But the RNN units can remember new information after training has completed. That is, they’re able to keep changing after training is over.

37

slide-38
SLIDE 38

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Memory

Question: How can we do this? How can build a unit that remembers the past? The memory or state can be written to a file but in RNNs, we keep it inside the recurrent unit. In an array or in a vector! Work with an example: Anna Sofia said her shoes are too ugly. Her here means Anna Sofia. Nikolas put his keys on the table. His here means Nikolas

38

slide-39
SLIDE 39

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Memory

Question: How can we do this? How can we build a unit that remembers the past? The memory or state can be written to a file but in RNNs, we keep it inside the recurrent unit. In an array or in a vector!

39

RNN

𝑌" (𝑓. 𝑕. ℎ𝑗𝑡) Memory

RNN

𝑍

" (𝑓. 𝑕. 𝑂𝑗𝑙𝑝𝑚𝑏𝑡)

Memory

slide-40
SLIDE 40

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Building an RNN

40

RNN

𝑍

"

Memory 𝑌" Memory

RNN

𝑍

"

Memory

𝑌"

Memory

RNN

𝑍

"6'

Memory

𝑌"6'

RNN

𝑍

"67

Memory

𝑌"67

RNN

𝑍

"68

Memory

𝑌"68

slide-41
SLIDE 41

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Outline

Why RNNs Main Concept of RNNs More Details of RNNs RNN training Gated RNN

41

slide-42
SLIDE 42

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Structure of an RNN cell

42

RNN

𝑍

"

State 𝑌"

RNN

𝑍

"

State 𝑌"

  • utput

weight update weight input weight

slide-43
SLIDE 43

CS109B, PROTOPAPAS, GLICKMAN, TANNER

43

Input, 𝑦" Hidden state, ℎ"&'

U V 𝑨" = 𝑊𝑦" + 𝑉ℎ"&' + 𝑐@

activation

ℎ" = 𝑕(𝑨") W2

activation activation

W1

Output, 𝑧"

𝑋

'ℎ" + 𝑐'

𝑋

7ℎ" + 𝑐7

Anatomy of an RNN unit

slide-44
SLIDE 44

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Outline

Why RNNs Main Concept of RNNs More Details of RNNs RNN training Gated RNN

44

slide-45
SLIDE 45

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Backprop Through Time

  • For each input, unfold network for the sequence length T
  • Back-propagation: apply forward and backward pass on unfolded

network

  • Memory cost: O(T)

45

slide-46
SLIDE 46

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Backprop Through Time

46

RNN

𝑍

"

State 𝑌"

  • utput

weights update weights input weights

slide-47
SLIDE 47

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Backprop Through Time

47

RNN

𝑍

"

State 𝑌"

  • utput

weights update weights input weights

slide-48
SLIDE 48

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Backprop Through Time

48

RNN

𝑍

"

State 𝑌" Update Weights: U Output Weights: W Input Weights: V

ℎ"

slide-49
SLIDE 49

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Backprop Through Time

49

You have two activation functions 𝑕C which serves as the activation for the hidden state and 𝑕D which is the activation of the output. In the example shown before 𝑕D was the identity.

ℎ"&7 𝑌"&7 V U W ℎ"&' 𝑌"&' V U W ℎ" 𝑌" V U W 𝑧 Et-2 𝑧 Et-1 𝑧 Et

slide-50
SLIDE 50

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Backprop Through Time

50

ℎ"&7 𝑌"&7 V U W ℎ"&' 𝑌"&' V U W ℎ" 𝑌" V U W 𝑧 Et-2 𝑧 Et-1 𝑧 Et

slide-51
SLIDE 51

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Backprop Through Time

51

𝜖𝑧 E" 𝜖𝑋 = 𝑕D′ℎ"

𝑀 = I 𝑀"

  • "

𝑒𝑀 𝑒𝑋 = I 𝑒𝑀" 𝑒𝑋

  • "

= I 𝜖𝑀" 𝜖𝑧 E" 𝜖𝑧 E" 𝜖𝑋

  • "

𝑀" = 𝑀"(𝑧 E")

𝑧 E" = 𝑕D 𝑋ℎ" + 𝑐

slide-52
SLIDE 52

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Backprop Through Time

52

𝑒𝑀 𝑒𝑉 = I 𝜖𝑀" 𝜖𝑧 E" 𝜖𝑧 E" 𝜖ℎ" 𝜖ℎ" 𝜖𝑉

  • "

𝑧 E" = 𝑕D 𝑋ℎ" + 𝑐

𝑀 = I 𝑀"

  • "

𝑀" = 𝑀"(𝑧 E")

ℎ" = 𝑕C 𝑊𝑦" + 𝑉ℎ"&' + 𝑐L

𝑧 E" = 𝑕D(𝑋𝑕C 𝑊𝑦" + 𝑉ℎ"&' + 𝑐L + 𝑐) MCN MO=∑ MCN MCQ MCQ MO " RS' 𝜖𝑀" 𝜖𝑉 = 𝜖𝑀" 𝜖𝑧 E" 𝜖𝑧 E" 𝜖ℎ" (𝑒ℎ" 𝑒𝑉 + 𝑒ℎ" 𝑒ℎ"&' 𝑒ℎ"&' 𝑒𝑉 + 𝑒ℎ" 𝑒ℎ"&' 𝑒ℎ"&' 𝑒ℎ"&7 𝑒ℎ"&7 𝑒𝑉 + ⋯ ) MCN MCQ= MCN MCNUV MCNUV MCNUW … MCQXV MCQ = ∏ MCZ MCZUV " [SR6' 𝜖ℎ[ 𝜖ℎ[&' = g]

L U

slide-53
SLIDE 53

CS109B, PROTOPAPAS, GLICKMAN, TANNER

53

slide-54
SLIDE 54

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Gradient Clipping

Prevents exploding gradients Clip the norm of gradient before update. For some derivative 𝑕, and some threshold u

54

if 𝑕 > 𝑣 𝑕 ⟵ 𝑕𝑣 𝑕

slide-55
SLIDE 55

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Outline

Why RNNs Main Concept of RNNs More Details of RNNs RNN training Gated RNN

55

slide-56
SLIDE 56

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Long-term Dependencies

Unfolded networks can be very deep Long-term interactions are given exponentially smaller weights than small-term interactions Gradients tend to either vanish or explode

56

slide-57
SLIDE 57

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Long Short-Term Memory

Handles long-term dependencies Leaky units where weight on self-loop α is context-dependent Allow network to decide whether to accumulate or forget past info

57

slide-58
SLIDE 58

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Notation

Using conventional and convenient notation

58

𝑍

"

𝑌"

slide-59
SLIDE 59

CS109B, PROTOPAPAS, GLICKMAN, TANNER

59

slide-60
SLIDE 60

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Simple RNN again

60

State V + W σ σ U 𝑌" 𝑍

"

ℎ" Input, 𝑦" Hidden state, ℎ"&'

U V

𝑨" = 𝑊𝑦" + 𝑉ℎ"&' + 𝑐@

activation

ℎ" = 𝑕(𝑨")

W2

activation activation

W1

Output, 𝑧"

𝑋

'ℎ" + 𝑐'

𝑋

7ℎ" + 𝑐7

slide-61
SLIDE 61

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Simple RNN again

61

State V + W σ σ U 𝑌" 𝑍

"

ℎ" Input, 𝑦" Hidden state, ℎ"&'

U V

𝑨" = 𝑊𝑦" + 𝑉ℎ"&' + 𝑐@

activation

ℎ" = 𝑕(𝑨")

W2

activation activation

W1

Output, 𝑧"

𝑋

'ℎ" + 𝑐'

𝑋

7ℎ" + 𝑐7

slide-62
SLIDE 62

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Simple RNN again: Memories

State V + W σ σ U 𝑌" 𝑍

"

ℎ"

slide-63
SLIDE 63

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Simple RNN again: Memories - Forgetting

State V + W σ σ U 𝑌" 𝑍

"

ℎ"

slide-64
SLIDE 64

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Simple RNN again: New Events

State V + W σ σ U 𝑌" 𝑍

"

ℎ"

slide-65
SLIDE 65

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Simple RNN again: New Events Weighted

State V + W σ σ U 𝑌" 𝑍

"

ℎ"

slide-66
SLIDE 66

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Simple RNN again: Updated memories

State V + W σ σ U 𝑌" 𝑍

"

ℎ"

slide-67
SLIDE 67

CS109B, PROTOPAPAS, GLICKMAN, TANNER

Ref

  • [Chen17b] Qiming Chen, Ren Wu, “CNN Is All You Need”, arXiv 1712.09662, 2017.

https://arxiv.org/abs/1712.09662

  • [Chu17] Hang Chu, Raquel Urtasun, Sanja Fidler, “Song From PI: A Musically Plausible

Network for Pop Music Generation”, arXiv preprint, 2017. https://arxiv.org/abs/1611.03477

  • [Johnson17] Daniel Johnson, “Composing Music with Recurrent Neural Networks”,

Heahedria, 2017. http://www.hexahedria.com/2015/08/03/ composing-music-with- recurrent-neural-networks/

  • [Deutsch16b] Max Deutsch, “Silicon Valley: A New Episode Written by AI”, Deep

Writing blog post, 2017. https://medium.com/deep-writing/ silicon-valley-a-new- episode-written-by-ai-a8f832645bc2

  • [Fan16] Bo Fan, Lijuan Wang, Frank K. Soong, Lei Xie “Photo-Real Talking Head with

Deep Bidirectional LSTM”, Multimedia Tools and Applications, 75(9), 2016. https://www.microsoft.com/en-us/research/wp- content/uploads/2015/04/icassp2015_fanbo_1009.pdf

67