Recurrent Neural Networks Graham Neubig Site - - PowerPoint PPT Presentation

recurrent neural networks
SMART_READER_LITE
LIVE PREVIEW

Recurrent Neural Networks Graham Neubig Site - - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and Sequential Data NLP and Sequential Data NLP is full of sequential data NLP and Sequential Data NLP is full of


slide-1
SLIDE 1

CS11-747 Neural Networks for NLP

Recurrent Neural Networks

Graham Neubig

Site https://phontron.com/class/nn4nlp2017/

slide-2
SLIDE 2

NLP and Sequential Data

slide-3
SLIDE 3

NLP and Sequential Data

  • NLP is full of sequential data
slide-4
SLIDE 4

NLP and Sequential Data

  • NLP is full of sequential data
  • Words in sentences
slide-5
SLIDE 5

NLP and Sequential Data

  • NLP is full of sequential data
  • Words in sentences
  • Characters in words
slide-6
SLIDE 6

NLP and Sequential Data

  • NLP is full of sequential data
  • Words in sentences
  • Characters in words
  • Sentences in discourse
slide-7
SLIDE 7

NLP and Sequential Data

  • NLP is full of sequential data
  • Words in sentences
  • Characters in words
  • Sentences in discourse
slide-8
SLIDE 8

Long-distance Dependencies in Language

slide-9
SLIDE 9

Long-distance Dependencies in Language

  • Agreement in number, gender, etc.
slide-10
SLIDE 10

Long-distance Dependencies in Language

  • Agreement in number, gender, etc.

He does not have very much confidence in himself. She does not have very much confidence in herself.

slide-11
SLIDE 11

Long-distance Dependencies in Language

  • Agreement in number, gender, etc.
  • Selectional preference

He does not have very much confidence in himself. She does not have very much confidence in herself.

slide-12
SLIDE 12

Long-distance Dependencies in Language

  • Agreement in number, gender, etc.
  • Selectional preference

He does not have very much confidence in himself. She does not have very much confidence in herself. The reign has lasted as long as the life of the queen. The rain has lasted as long as the life of the clouds.

slide-13
SLIDE 13

Can be Complicated!

slide-14
SLIDE 14

Can be Complicated!

  • What is the referent of “it”?
slide-15
SLIDE 15

Can be Complicated!

  • What is the referent of “it”?

The trophy would not fit in the brown suitcase because it was too big.

slide-16
SLIDE 16

Can be Complicated!

  • What is the referent of “it”?

The trophy would not fit in the brown suitcase because it was too big.

Trophy

slide-17
SLIDE 17

Can be Complicated!

  • What is the referent of “it”?

The trophy would not fit in the brown suitcase because it was too big. The trophy would not fit in the brown suitcase because it was too small.

Trophy

slide-18
SLIDE 18

Can be Complicated!

  • What is the referent of “it”?

The trophy would not fit in the brown suitcase because it was too big. The trophy would not fit in the brown suitcase because it was too small.

Trophy Suitcase

slide-19
SLIDE 19

Can be Complicated!

  • What is the referent of “it”?

The trophy would not fit in the brown suitcase because it was too big. The trophy would not fit in the brown suitcase because it was too small.

(from Winograd Schema Challenge: http://commonsensereasoning.org/winograd.html) Trophy Suitcase

slide-20
SLIDE 20

Recurrent Neural Networks

(Elman 1990)

slide-21
SLIDE 21

Recurrent Neural Networks

(Elman 1990)

  • Tools to “remember” information
slide-22
SLIDE 22

Recurrent Neural Networks

(Elman 1990)

Feed-forward NN

lookup

transform

predict context label

  • Tools to “remember” information
slide-23
SLIDE 23

Recurrent Neural Networks

(Elman 1990)

Feed-forward NN

lookup

transform

predict context label

Recurrent NN

lookup

transform

predict context label

  • Tools to “remember” information
slide-24
SLIDE 24

Unrolling in Time

  • What does processing a sequence look like?

I hate this movie

slide-25
SLIDE 25

Unrolling in Time

  • What does processing a sequence look like?

I hate this movie

slide-26
SLIDE 26

Unrolling in Time

  • What does processing a sequence look like?

I hate this movie

RNN

slide-27
SLIDE 27

Unrolling in Time

  • What does processing a sequence look like?

I hate this movie

RNN predict label

slide-28
SLIDE 28

Unrolling in Time

  • What does processing a sequence look like?

I hate this movie

RNN RNN predict label

slide-29
SLIDE 29

Unrolling in Time

  • What does processing a sequence look like?

I hate this movie

RNN RNN predict label predict label

slide-30
SLIDE 30

Unrolling in Time

  • What does processing a sequence look like?

I hate this movie

RNN RNN RNN predict label predict label

slide-31
SLIDE 31

Unrolling in Time

  • What does processing a sequence look like?

I hate this movie

RNN RNN RNN predict label predict label predict label

slide-32
SLIDE 32

Unrolling in Time

  • What does processing a sequence look like?

I hate this movie

RNN RNN RNN RNN predict label predict label predict label

slide-33
SLIDE 33

Unrolling in Time

  • What does processing a sequence look like?

I hate this movie

RNN RNN RNN RNN predict label predict label predict label predict label

slide-34
SLIDE 34

Training RNNs

I hate this movie

RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4

slide-35
SLIDE 35

Training RNNs

I hate this movie

RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4

slide-36
SLIDE 36

Training RNNs

I hate this movie

RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 loss 1

slide-37
SLIDE 37

Training RNNs

I hate this movie

RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 loss 1 loss 2

slide-38
SLIDE 38

Training RNNs

I hate this movie

RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 loss 1 loss 2 loss 3

slide-39
SLIDE 39

Training RNNs

I hate this movie

RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 loss 1 loss 2 loss 3 loss 4

slide-40
SLIDE 40

Training RNNs

I hate this movie

RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 loss 1 loss 2 loss 3 loss 4 sum

slide-41
SLIDE 41

Training RNNs

I hate this movie

RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 loss 1 loss 2 loss 3 loss 4 sum total loss

slide-42
SLIDE 42

RNN Training

slide-43
SLIDE 43

RNN Training

  • The unrolled graph is a well-formed (DAG)

computation graph—we can run backprop
 
 
 


slide-44
SLIDE 44

RNN Training

  • The unrolled graph is a well-formed (DAG)

computation graph—we can run backprop
 
 
 


sum total loss

slide-45
SLIDE 45

RNN Training

  • The unrolled graph is a well-formed (DAG)

computation graph—we can run backprop
 
 
 


sum total loss

slide-46
SLIDE 46

RNN Training

  • The unrolled graph is a well-formed (DAG)

computation graph—we can run backprop
 
 
 


  • Parameters are tied across time, derivatives are

aggregated across all time steps

sum total loss

slide-47
SLIDE 47

RNN Training

  • The unrolled graph is a well-formed (DAG)

computation graph—we can run backprop
 
 
 


  • Parameters are tied across time, derivatives are

aggregated across all time steps

  • This is historically called “backpropagation through

time” (BPTT)

sum total loss

slide-48
SLIDE 48

Parameter Tying

I hate this movie

RNN RNN RNN RNN predict prediction 1 predict predict predict loss 1 loss 2 loss 3 loss 4 prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 sum total loss

slide-49
SLIDE 49

Parameter Tying

I hate this movie

RNN RNN RNN RNN predict prediction 1 predict predict predict loss 1 loss 2 loss 3 loss 4 prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 sum total loss

Parameters are shared! Derivatives are accumulated.

slide-50
SLIDE 50

Applications of RNNs

slide-51
SLIDE 51

What Can RNNs Do?

slide-52
SLIDE 52

What Can RNNs Do?

  • Represent a sentence
slide-53
SLIDE 53

What Can RNNs Do?

  • Represent a sentence
  • Read whole sentence, make a prediction
slide-54
SLIDE 54

What Can RNNs Do?

  • Represent a sentence
  • Read whole sentence, make a prediction
  • Represent a context within a sentence
slide-55
SLIDE 55

What Can RNNs Do?

  • Represent a sentence
  • Read whole sentence, make a prediction
  • Represent a context within a sentence
  • Read context up until that point
slide-56
SLIDE 56

Representing Sentences

I hate this movie

RNN RNN RNN RNN

slide-57
SLIDE 57

Representing Sentences

I hate this movie

RNN RNN RNN RNN predict prediction

slide-58
SLIDE 58

Representing Sentences

I hate this movie

RNN RNN RNN RNN predict prediction

  • Sentence classification
slide-59
SLIDE 59

Representing Sentences

I hate this movie

RNN RNN RNN RNN predict prediction

  • Sentence classification
  • Conditioned generation
slide-60
SLIDE 60

Representing Sentences

I hate this movie

RNN RNN RNN RNN predict prediction

  • Sentence classification
  • Conditioned generation
  • Retrieval
slide-61
SLIDE 61

Representing Contexts

I hate this movie

RNN RNN RNN RNN

slide-62
SLIDE 62

Representing Contexts

I hate this movie

RNN RNN RNN RNN predict label predict label predict label predict label

slide-63
SLIDE 63

Representing Contexts

I hate this movie

RNN RNN RNN RNN predict label predict label predict label predict label

  • Tagging
slide-64
SLIDE 64

Representing Contexts

I hate this movie

RNN RNN RNN RNN predict label predict label predict label predict label

  • Tagging
  • Language Modeling
slide-65
SLIDE 65

Representing Contexts

I hate this movie

RNN RNN RNN RNN predict label predict label predict label predict label

  • Tagging
  • Language Modeling
  • Calculating Representations for Parsing, etc.
slide-66
SLIDE 66

e.g. Language Modeling

  • Language modeling is like a tagging task, where

each tag is the next word!

slide-67
SLIDE 67

e.g. Language Modeling

<s>

  • Language modeling is like a tagging task, where

each tag is the next word!

slide-68
SLIDE 68

e.g. Language Modeling

RNN

<s>

  • Language modeling is like a tagging task, where

each tag is the next word!

slide-69
SLIDE 69

e.g. Language Modeling

RNN

<s>

predict I

  • Language modeling is like a tagging task, where

each tag is the next word!

slide-70
SLIDE 70

e.g. Language Modeling

I

RNN

<s>

predict I

  • Language modeling is like a tagging task, where

each tag is the next word!

slide-71
SLIDE 71

e.g. Language Modeling

RNN

I

RNN

<s>

predict I

  • Language modeling is like a tagging task, where

each tag is the next word!

slide-72
SLIDE 72

e.g. Language Modeling

RNN

I

predict hate RNN

<s>

predict I

  • Language modeling is like a tagging task, where

each tag is the next word!

slide-73
SLIDE 73

e.g. Language Modeling

RNN

hate I

predict hate RNN

<s>

predict I

  • Language modeling is like a tagging task, where

each tag is the next word!

slide-74
SLIDE 74

e.g. Language Modeling

RNN RNN

hate I

predict hate RNN

<s>

predict I

  • Language modeling is like a tagging task, where

each tag is the next word!

slide-75
SLIDE 75

e.g. Language Modeling

RNN RNN

hate I

predict hate predict this RNN

<s>

predict I

  • Language modeling is like a tagging task, where

each tag is the next word!

slide-76
SLIDE 76

e.g. Language Modeling

RNN RNN

this hate I

predict hate predict this RNN

<s>

predict I

  • Language modeling is like a tagging task, where

each tag is the next word!

slide-77
SLIDE 77

e.g. Language Modeling

RNN RNN RNN

this hate I

predict hate predict this RNN

<s>

predict I

  • Language modeling is like a tagging task, where

each tag is the next word!

slide-78
SLIDE 78

e.g. Language Modeling

RNN RNN RNN

this hate I

predict hate predict this predict movie RNN

<s>

predict I

  • Language modeling is like a tagging task, where

each tag is the next word!

slide-79
SLIDE 79

e.g. Language Modeling

RNN RNN RNN

movie this hate I

predict hate predict this predict movie RNN

<s>

predict I

  • Language modeling is like a tagging task, where

each tag is the next word!

slide-80
SLIDE 80

e.g. Language Modeling

RNN RNN RNN RNN

movie this hate I

predict hate predict this predict movie RNN

<s>

predict I

  • Language modeling is like a tagging task, where

each tag is the next word!

slide-81
SLIDE 81

e.g. Language Modeling

RNN RNN RNN RNN

movie this hate I

predict hate predict this predict movie predict </s> RNN

<s>

predict I

  • Language modeling is like a tagging task, where

each tag is the next word!

slide-82
SLIDE 82

Bi-RNNs

  • A simple extension, run the RNN in both directions

I hate this movie

RNN RNN RNN RNN RNN RNN RNN RNN

slide-83
SLIDE 83

Bi-RNNs

  • A simple extension, run the RNN in both directions

I hate this movie

RNN RNN RNN RNN RNN RNN RNN RNN concat

slide-84
SLIDE 84

Bi-RNNs

  • A simple extension, run the RNN in both directions

I hate this movie

RNN RNN RNN RNN RNN RNN RNN RNN concat

softmax

PRN

slide-85
SLIDE 85

Bi-RNNs

  • A simple extension, run the RNN in both directions

I hate this movie

RNN RNN RNN RNN RNN RNN RNN RNN concat concat

softmax

PRN

slide-86
SLIDE 86

Bi-RNNs

  • A simple extension, run the RNN in both directions

I hate this movie

RNN RNN RNN RNN RNN RNN RNN RNN concat concat

softmax

PRN

softmax

VB

slide-87
SLIDE 87

Bi-RNNs

  • A simple extension, run the RNN in both directions

I hate this movie

RNN RNN RNN RNN RNN RNN RNN RNN concat concat concat

softmax

PRN

softmax

VB

slide-88
SLIDE 88

Bi-RNNs

  • A simple extension, run the RNN in both directions

I hate this movie

RNN RNN RNN RNN RNN RNN RNN RNN concat concat concat

softmax

PRN

softmax

VB

softmax

DET

slide-89
SLIDE 89

Bi-RNNs

  • A simple extension, run the RNN in both directions

I hate this movie

RNN RNN RNN RNN RNN RNN RNN RNN concat concat concat concat

softmax

PRN

softmax

VB

softmax

DET

slide-90
SLIDE 90

Bi-RNNs

  • A simple extension, run the RNN in both directions

I hate this movie

RNN RNN RNN RNN RNN RNN RNN RNN concat concat concat concat

softmax

PRN

softmax

VB

softmax

DET

softmax

NN

slide-91
SLIDE 91

Let’s Try it Out!

slide-92
SLIDE 92

Recurrent Neural Networks in DyNet

slide-93
SLIDE 93

Recurrent Neural Networks in DyNet

  • Based on “*Builder” class (*=SimpleRNN/LSTM)
slide-94
SLIDE 94

Recurrent Neural Networks in DyNet

  • Based on “*Builder” class (*=SimpleRNN/LSTM)

# LSTM (layers=1, input=64, hidden=128, model) RNN = dy.SimpleRNNBuilder(1, 64, 128, model)

  • Add parameters to model (once):
slide-95
SLIDE 95

Recurrent Neural Networks in DyNet

  • Based on “*Builder” class (*=SimpleRNN/LSTM)

# LSTM (layers=1, input=64, hidden=128, model) RNN = dy.SimpleRNNBuilder(1, 64, 128, model)

  • Add parameters to model (once):
  • Add parameters to CG and get initial state (per sentence):

s = RNN.initial_state()

slide-96
SLIDE 96

Recurrent Neural Networks in DyNet

  • Based on “*Builder” class (*=SimpleRNN/LSTM)

# LSTM (layers=1, input=64, hidden=128, model) RNN = dy.SimpleRNNBuilder(1, 64, 128, model)

  • Add parameters to model (once):
  • Add parameters to CG and get initial state (per sentence):

s = RNN.initial_state()

  • Update state and access (per input word/character):

s = s.add_input(x_t) h_t = s.output()

slide-97
SLIDE 97

RNNLM Example: Parameter Initialization

# Lookup parameters for word embeddings WORDS_LOOKUP = model.add_lookup_parameters((nwords, 64)) # Word-level RNN (layers=1, input=64, hidden=128, model) RNN = dy.SimpleRNNBuilder(1, 64, 128, model) # Softmax weights/biases on top of RNN outputs W_sm = model.add_parameters((nwords, 128)) b_sm = model.add_parameters(nwords)

slide-98
SLIDE 98

RNNLM Example: Sentence Initialization

# Build the language model graph def calc_lm_loss(wids): dy.renew_cg() # parameters -> expressions W_exp = dy.parameter(W_sm) b_exp = dy.parameter(b_sm) # add parameters to CG and get state f_init = RNN.initial_state() # get the word vectors for each word ID wembs = [WORDS_LOOKUP[wid] for wid in wids] # Start the rnn by inputting "<s>" s = f_init.add_input(wembs[-1])

slide-99
SLIDE 99

RNNLM Example: Loss Calculation and State Update

# process each word ID and embedding losses = [] for wid, we in zip(wids, wembs): # calculate and save the softmax loss score = W_exp * s.output() + b_exp loss = dy.pickneglogsoftmax(score, wid) losses.append(loss) # update the RNN state with the input s = s.add_input(we) # return the sum of all losses return dy.esum(losses)

slide-100
SLIDE 100

Code Examples

sentiment-rnn.py

slide-101
SLIDE 101

RNN Problems and Alternatives

slide-102
SLIDE 102

Vanishing Gradient

  • Gradients decrease as they get pushed back
slide-103
SLIDE 103

Vanishing Gradient

  • Gradients decrease as they get pushed back
  • Why? “Squashed” by non-linearities or small

weights in matrices.

slide-104
SLIDE 104

A Solution:
 Long Short-term Memory

(Hochreiter and Schmidhuber 1997)

slide-105
SLIDE 105

A Solution:
 Long Short-term Memory

(Hochreiter and Schmidhuber 1997)

  • Basic idea: make additive connections between

time steps

slide-106
SLIDE 106

A Solution:
 Long Short-term Memory

(Hochreiter and Schmidhuber 1997)

  • Basic idea: make additive connections between

time steps

  • Addition does not modify the gradient, no vanishing
slide-107
SLIDE 107

A Solution:
 Long Short-term Memory

(Hochreiter and Schmidhuber 1997)

  • Basic idea: make additive connections between

time steps

  • Addition does not modify the gradient, no vanishing
  • Gates to control the information flow
slide-108
SLIDE 108

LSTM Structure

slide-109
SLIDE 109

Other Alternatives

slide-110
SLIDE 110

Other Alternatives

  • Lots of variants of LSTMs (Hochreiter and

Schmidhuber, 1997)

slide-111
SLIDE 111

Other Alternatives

  • Lots of variants of LSTMs (Hochreiter and

Schmidhuber, 1997)

  • Gated recurrent units (GRUs; Cho et al., 2014)
slide-112
SLIDE 112

Other Alternatives

  • Lots of variants of LSTMs (Hochreiter and

Schmidhuber, 1997)

  • Gated recurrent units (GRUs; Cho et al., 2014)
  • All follow the basic paradigm of “take input, update

state”

slide-113
SLIDE 113

Code Examples

sentiment-lstm.py lm-lstm.py

slide-114
SLIDE 114

Efficiency/Memory Tricks

slide-115
SLIDE 115

Handling Mini-batching

slide-116
SLIDE 116

Handling Mini-batching

  • Mini-batching makes things much faster!
slide-117
SLIDE 117

Handling Mini-batching

  • Mini-batching makes things much faster!
  • But mini-batching in RNNs is harder than in feed-

forward networks

slide-118
SLIDE 118

Handling Mini-batching

  • Mini-batching makes things much faster!
  • But mini-batching in RNNs is harder than in feed-

forward networks

  • Each word depends on the previous word
slide-119
SLIDE 119

Handling Mini-batching

  • Mini-batching makes things much faster!
  • But mini-batching in RNNs is harder than in feed-

forward networks

  • Each word depends on the previous word
  • Sequences are of various length
slide-120
SLIDE 120

Mini-batching Method

this is an example </s> this is another </s>

slide-121
SLIDE 121

Mini-batching Method

this is an example </s> this is another </s> </s> Padding

slide-122
SLIDE 122

Mini-batching Method

this is an example </s> this is another </s> </s> Padding Loss Calculation

slide-123
SLIDE 123

Mini-batching Method

this is an example </s> this is another </s> </s> Padding Loss Calculation Mask

1
 1

  • 1


1

  • 1


1

  • 1


1

  • 1

slide-124
SLIDE 124

Mini-batching Method

this is an example </s> this is another </s> </s> Padding Loss Calculation Mask

1
 1

  • 1


1

  • 1


1

  • 1


1

  • 1

slide-125
SLIDE 125

Mini-batching Method

this is an example </s> this is another </s> </s> Padding Loss Calculation Mask

1
 1

  • 1


1

  • 1


1

  • 1


1

  • 1

  • Take Sum
slide-126
SLIDE 126

Mini-batching Method

this is an example </s> this is another </s> </s> Padding Loss Calculation Mask

1
 1

  • 1


1

  • 1


1

  • 1


1

  • 1

  • Take Sum

(Or use DyNet automatic mini-batching, much easier but a bit slower)

slide-127
SLIDE 127

Bucketing/Sorting

slide-128
SLIDE 128

Bucketing/Sorting

  • If we use sentences of different lengths, too much

padding and sorting can result in decreased performance

slide-129
SLIDE 129

Bucketing/Sorting

  • If we use sentences of different lengths, too much

padding and sorting can result in decreased performance

  • To remedy this: sort sentences so similarly-

lengthed sentences are in the same batch

slide-130
SLIDE 130

Code Example

lm-minibatch.py

slide-131
SLIDE 131

Handling Long Sequences

slide-132
SLIDE 132

Handling Long Sequences

  • Sometimes we would like to capture long-term

dependencies over long sequences

slide-133
SLIDE 133

Handling Long Sequences

  • Sometimes we would like to capture long-term

dependencies over long sequences

  • e.g. words in full documents
slide-134
SLIDE 134

Handling Long Sequences

  • Sometimes we would like to capture long-term

dependencies over long sequences

  • e.g. words in full documents
  • However, this may not fit on (GPU) memory
slide-135
SLIDE 135

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie It is so bad 1st Pass 2nd Pass

slide-136
SLIDE 136

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie It is so bad 1st Pass 2nd Pass

slide-137
SLIDE 137

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie It is so bad 1st Pass 2nd Pass

slide-138
SLIDE 138

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie It is so bad 1st Pass 2nd Pass

slide-139
SLIDE 139

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie It is so bad 1st Pass 2nd Pass

slide-140
SLIDE 140

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie It is so bad 1st Pass 2nd Pass

slide-141
SLIDE 141

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie

RNN

It is so bad 1st Pass 2nd Pass

slide-142
SLIDE 142

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie

RNN RNN

It is so bad 1st Pass 2nd Pass

slide-143
SLIDE 143

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie

RNN RNN RNN

It is so bad 1st Pass 2nd Pass

slide-144
SLIDE 144

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie

RNN RNN RNN RNN

It is so bad 1st Pass 2nd Pass

slide-145
SLIDE 145

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie

RNN RNN RNN RNN

It is so bad state only, no backprop 1st Pass 2nd Pass

slide-146
SLIDE 146

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie

RNN RNN RNN RNN

It is so bad state only, no backprop 1st Pass 2nd Pass

slide-147
SLIDE 147

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie

RNN RNN RNN RNN

It is so bad state only, no backprop 1st Pass 2nd Pass

slide-148
SLIDE 148

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie

RNN RNN RNN RNN

It is so bad state only, no backprop 1st Pass 2nd Pass

slide-149
SLIDE 149

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie

RNN RNN RNN RNN

It is so bad state only, no backprop 1st Pass 2nd Pass

slide-150
SLIDE 150

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie

RNN RNN RNN RNN

It is so bad

RNN

state only, no backprop 1st Pass 2nd Pass

slide-151
SLIDE 151

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie

RNN RNN RNN RNN

It is so bad

RNN RNN

state only, no backprop 1st Pass 2nd Pass

slide-152
SLIDE 152

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie

RNN RNN RNN RNN

It is so bad

RNN RNN RNN

state only, no backprop 1st Pass 2nd Pass

slide-153
SLIDE 153

Truncated BPTT

  • Backprop over shorter segments, initialize w/ the

state from the previous segment I hate this movie

RNN RNN RNN RNN

It is so bad

RNN RNN RNN RNN

state only, no backprop 1st Pass 2nd Pass

slide-154
SLIDE 154

Pre-training/Transfer
 for RNNs

slide-155
SLIDE 155

RNN Strengths/Weaknesses

slide-156
SLIDE 156

RNN Strengths/Weaknesses

  • RNNs, particularly deep RNNs/LSTMs, are quite

powerful and flexible

slide-157
SLIDE 157

RNN Strengths/Weaknesses

  • RNNs, particularly deep RNNs/LSTMs, are quite

powerful and flexible

  • But they require a lot of data
slide-158
SLIDE 158

RNN Strengths/Weaknesses

  • RNNs, particularly deep RNNs/LSTMs, are quite

powerful and flexible

  • But they require a lot of data
  • Also have trouble with weak error signals passed

back from the end of the sentence

slide-159
SLIDE 159

Pre-training/Transfer

slide-160
SLIDE 160

Pre-training/Transfer

  • Train for one task, solve another
slide-161
SLIDE 161

Pre-training/Transfer

  • Train for one task, solve another
  • Pre-training task: Big data, easy to learn
slide-162
SLIDE 162

Pre-training/Transfer

  • Train for one task, solve another
  • Pre-training task: Big data, easy to learn
  • Main task: Small data, harder to learn
slide-163
SLIDE 163

Example:
 LM -> Sentence Classifier

(Luong et al. 2015)

slide-164
SLIDE 164

Example:
 LM -> Sentence Classifier

(Luong et al. 2015)

  • Train a language model first: lots of data, easy-to-

learn objective

slide-165
SLIDE 165

Example:
 LM -> Sentence Classifier

(Luong et al. 2015)

  • Train a language model first: lots of data, easy-to-

learn objective

  • Sentence classification: little data, hard-to-learn
  • bjective
slide-166
SLIDE 166

Example:
 LM -> Sentence Classifier

(Luong et al. 2015)

  • Train a language model first: lots of data, easy-to-

learn objective

  • Sentence classification: little data, hard-to-learn
  • bjective
  • Results in much better classifications, competitive
  • r better than CNN-based methods
slide-167
SLIDE 167

Why Pre-training?

slide-168
SLIDE 168

Why Pre-training?

  • The model learns consistencies in the data (Karpathy et al. 2015) 



 
 
 
 
 
 _
 
 
 
 
 
 
 
 
 


slide-169
SLIDE 169

Why Pre-training?

  • The model learns consistencies in the data (Karpathy et al. 2015) 



 
 
 
 
 
 _
 
 
 
 
 
 
 
 
 


slide-170
SLIDE 170

Why Pre-training?

  • The model learns consistencies in the data (Karpathy et al. 2015) 



 
 
 
 
 
 _
 
 
 
 
 
 
 
 
 


  • Model learns syntax (Shi et al. 2017) or semantics (Radford et al. 2017)
slide-171
SLIDE 171

Questions?