SLIDE 1 CS11-747 Neural Networks for NLP
Recurrent Neural Networks
Graham Neubig
Site https://phontron.com/class/nn4nlp2017/
SLIDE 2
NLP and Sequential Data
SLIDE 3 NLP and Sequential Data
- NLP is full of sequential data
SLIDE 4 NLP and Sequential Data
- NLP is full of sequential data
- Words in sentences
SLIDE 5 NLP and Sequential Data
- NLP is full of sequential data
- Words in sentences
- Characters in words
SLIDE 6 NLP and Sequential Data
- NLP is full of sequential data
- Words in sentences
- Characters in words
- Sentences in discourse
SLIDE 7 NLP and Sequential Data
- NLP is full of sequential data
- Words in sentences
- Characters in words
- Sentences in discourse
- …
SLIDE 8
Long-distance Dependencies in Language
SLIDE 9 Long-distance Dependencies in Language
- Agreement in number, gender, etc.
SLIDE 10 Long-distance Dependencies in Language
- Agreement in number, gender, etc.
He does not have very much confidence in himself. She does not have very much confidence in herself.
SLIDE 11 Long-distance Dependencies in Language
- Agreement in number, gender, etc.
- Selectional preference
He does not have very much confidence in himself. She does not have very much confidence in herself.
SLIDE 12 Long-distance Dependencies in Language
- Agreement in number, gender, etc.
- Selectional preference
He does not have very much confidence in himself. She does not have very much confidence in herself. The reign has lasted as long as the life of the queen. The rain has lasted as long as the life of the clouds.
SLIDE 13
Can be Complicated!
SLIDE 14 Can be Complicated!
- What is the referent of “it”?
SLIDE 15 Can be Complicated!
- What is the referent of “it”?
The trophy would not fit in the brown suitcase because it was too big.
SLIDE 16 Can be Complicated!
- What is the referent of “it”?
The trophy would not fit in the brown suitcase because it was too big.
Trophy
SLIDE 17 Can be Complicated!
- What is the referent of “it”?
The trophy would not fit in the brown suitcase because it was too big. The trophy would not fit in the brown suitcase because it was too small.
Trophy
SLIDE 18 Can be Complicated!
- What is the referent of “it”?
The trophy would not fit in the brown suitcase because it was too big. The trophy would not fit in the brown suitcase because it was too small.
Trophy Suitcase
SLIDE 19 Can be Complicated!
- What is the referent of “it”?
The trophy would not fit in the brown suitcase because it was too big. The trophy would not fit in the brown suitcase because it was too small.
(from Winograd Schema Challenge: http://commonsensereasoning.org/winograd.html) Trophy Suitcase
SLIDE 20
Recurrent Neural Networks
(Elman 1990)
SLIDE 21 Recurrent Neural Networks
(Elman 1990)
- Tools to “remember” information
SLIDE 22 Recurrent Neural Networks
(Elman 1990)
Feed-forward NN
lookup
transform
predict context label
- Tools to “remember” information
SLIDE 23 Recurrent Neural Networks
(Elman 1990)
Feed-forward NN
lookup
transform
predict context label
Recurrent NN
lookup
transform
predict context label
- Tools to “remember” information
SLIDE 24 Unrolling in Time
- What does processing a sequence look like?
I hate this movie
SLIDE 25 Unrolling in Time
- What does processing a sequence look like?
I hate this movie
SLIDE 26 Unrolling in Time
- What does processing a sequence look like?
I hate this movie
RNN
SLIDE 27 Unrolling in Time
- What does processing a sequence look like?
I hate this movie
RNN predict label
SLIDE 28 Unrolling in Time
- What does processing a sequence look like?
I hate this movie
RNN RNN predict label
SLIDE 29 Unrolling in Time
- What does processing a sequence look like?
I hate this movie
RNN RNN predict label predict label
SLIDE 30 Unrolling in Time
- What does processing a sequence look like?
I hate this movie
RNN RNN RNN predict label predict label
SLIDE 31 Unrolling in Time
- What does processing a sequence look like?
I hate this movie
RNN RNN RNN predict label predict label predict label
SLIDE 32 Unrolling in Time
- What does processing a sequence look like?
I hate this movie
RNN RNN RNN RNN predict label predict label predict label
SLIDE 33 Unrolling in Time
- What does processing a sequence look like?
I hate this movie
RNN RNN RNN RNN predict label predict label predict label predict label
SLIDE 34 Training RNNs
I hate this movie
RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4
SLIDE 35 Training RNNs
I hate this movie
RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4
SLIDE 36 Training RNNs
I hate this movie
RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 loss 1
SLIDE 37 Training RNNs
I hate this movie
RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 loss 1 loss 2
SLIDE 38 Training RNNs
I hate this movie
RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 loss 1 loss 2 loss 3
SLIDE 39 Training RNNs
I hate this movie
RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 loss 1 loss 2 loss 3 loss 4
SLIDE 40 Training RNNs
I hate this movie
RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 loss 1 loss 2 loss 3 loss 4 sum
SLIDE 41 Training RNNs
I hate this movie
RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 loss 1 loss 2 loss 3 loss 4 sum total loss
SLIDE 42
RNN Training
SLIDE 43 RNN Training
- The unrolled graph is a well-formed (DAG)
computation graph—we can run backprop
SLIDE 44 RNN Training
- The unrolled graph is a well-formed (DAG)
computation graph—we can run backprop
sum total loss
SLIDE 45 RNN Training
- The unrolled graph is a well-formed (DAG)
computation graph—we can run backprop
sum total loss
SLIDE 46 RNN Training
- The unrolled graph is a well-formed (DAG)
computation graph—we can run backprop
- Parameters are tied across time, derivatives are
aggregated across all time steps
sum total loss
SLIDE 47 RNN Training
- The unrolled graph is a well-formed (DAG)
computation graph—we can run backprop
- Parameters are tied across time, derivatives are
aggregated across all time steps
- This is historically called “backpropagation through
time” (BPTT)
sum total loss
SLIDE 48 Parameter Tying
I hate this movie
RNN RNN RNN RNN predict prediction 1 predict predict predict loss 1 loss 2 loss 3 loss 4 prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 sum total loss
SLIDE 49 Parameter Tying
I hate this movie
RNN RNN RNN RNN predict prediction 1 predict predict predict loss 1 loss 2 loss 3 loss 4 prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 sum total loss
Parameters are shared! Derivatives are accumulated.
SLIDE 50
Applications of RNNs
SLIDE 51
What Can RNNs Do?
SLIDE 52 What Can RNNs Do?
SLIDE 53 What Can RNNs Do?
- Represent a sentence
- Read whole sentence, make a prediction
SLIDE 54 What Can RNNs Do?
- Represent a sentence
- Read whole sentence, make a prediction
- Represent a context within a sentence
SLIDE 55 What Can RNNs Do?
- Represent a sentence
- Read whole sentence, make a prediction
- Represent a context within a sentence
- Read context up until that point
SLIDE 56 Representing Sentences
I hate this movie
RNN RNN RNN RNN
SLIDE 57 Representing Sentences
I hate this movie
RNN RNN RNN RNN predict prediction
SLIDE 58 Representing Sentences
I hate this movie
RNN RNN RNN RNN predict prediction
SLIDE 59 Representing Sentences
I hate this movie
RNN RNN RNN RNN predict prediction
- Sentence classification
- Conditioned generation
SLIDE 60 Representing Sentences
I hate this movie
RNN RNN RNN RNN predict prediction
- Sentence classification
- Conditioned generation
- Retrieval
SLIDE 61 Representing Contexts
I hate this movie
RNN RNN RNN RNN
SLIDE 62 Representing Contexts
I hate this movie
RNN RNN RNN RNN predict label predict label predict label predict label
SLIDE 63 Representing Contexts
I hate this movie
RNN RNN RNN RNN predict label predict label predict label predict label
SLIDE 64 Representing Contexts
I hate this movie
RNN RNN RNN RNN predict label predict label predict label predict label
- Tagging
- Language Modeling
SLIDE 65 Representing Contexts
I hate this movie
RNN RNN RNN RNN predict label predict label predict label predict label
- Tagging
- Language Modeling
- Calculating Representations for Parsing, etc.
SLIDE 66 e.g. Language Modeling
- Language modeling is like a tagging task, where
each tag is the next word!
SLIDE 67 e.g. Language Modeling
<s>
- Language modeling is like a tagging task, where
each tag is the next word!
SLIDE 68 e.g. Language Modeling
RNN
<s>
- Language modeling is like a tagging task, where
each tag is the next word!
SLIDE 69 e.g. Language Modeling
RNN
<s>
predict I
- Language modeling is like a tagging task, where
each tag is the next word!
SLIDE 70 e.g. Language Modeling
I
RNN
<s>
predict I
- Language modeling is like a tagging task, where
each tag is the next word!
SLIDE 71 e.g. Language Modeling
RNN
I
RNN
<s>
predict I
- Language modeling is like a tagging task, where
each tag is the next word!
SLIDE 72 e.g. Language Modeling
RNN
I
predict hate RNN
<s>
predict I
- Language modeling is like a tagging task, where
each tag is the next word!
SLIDE 73 e.g. Language Modeling
RNN
hate I
predict hate RNN
<s>
predict I
- Language modeling is like a tagging task, where
each tag is the next word!
SLIDE 74 e.g. Language Modeling
RNN RNN
hate I
predict hate RNN
<s>
predict I
- Language modeling is like a tagging task, where
each tag is the next word!
SLIDE 75 e.g. Language Modeling
RNN RNN
hate I
predict hate predict this RNN
<s>
predict I
- Language modeling is like a tagging task, where
each tag is the next word!
SLIDE 76 e.g. Language Modeling
RNN RNN
this hate I
predict hate predict this RNN
<s>
predict I
- Language modeling is like a tagging task, where
each tag is the next word!
SLIDE 77 e.g. Language Modeling
RNN RNN RNN
this hate I
predict hate predict this RNN
<s>
predict I
- Language modeling is like a tagging task, where
each tag is the next word!
SLIDE 78 e.g. Language Modeling
RNN RNN RNN
this hate I
predict hate predict this predict movie RNN
<s>
predict I
- Language modeling is like a tagging task, where
each tag is the next word!
SLIDE 79 e.g. Language Modeling
RNN RNN RNN
movie this hate I
predict hate predict this predict movie RNN
<s>
predict I
- Language modeling is like a tagging task, where
each tag is the next word!
SLIDE 80 e.g. Language Modeling
RNN RNN RNN RNN
movie this hate I
predict hate predict this predict movie RNN
<s>
predict I
- Language modeling is like a tagging task, where
each tag is the next word!
SLIDE 81 e.g. Language Modeling
RNN RNN RNN RNN
movie this hate I
predict hate predict this predict movie predict </s> RNN
<s>
predict I
- Language modeling is like a tagging task, where
each tag is the next word!
SLIDE 82 Bi-RNNs
- A simple extension, run the RNN in both directions
I hate this movie
RNN RNN RNN RNN RNN RNN RNN RNN
SLIDE 83 Bi-RNNs
- A simple extension, run the RNN in both directions
I hate this movie
RNN RNN RNN RNN RNN RNN RNN RNN concat
SLIDE 84 Bi-RNNs
- A simple extension, run the RNN in both directions
I hate this movie
RNN RNN RNN RNN RNN RNN RNN RNN concat
softmax
PRN
SLIDE 85 Bi-RNNs
- A simple extension, run the RNN in both directions
I hate this movie
RNN RNN RNN RNN RNN RNN RNN RNN concat concat
softmax
PRN
SLIDE 86 Bi-RNNs
- A simple extension, run the RNN in both directions
I hate this movie
RNN RNN RNN RNN RNN RNN RNN RNN concat concat
softmax
PRN
softmax
VB
SLIDE 87 Bi-RNNs
- A simple extension, run the RNN in both directions
I hate this movie
RNN RNN RNN RNN RNN RNN RNN RNN concat concat concat
softmax
PRN
softmax
VB
SLIDE 88 Bi-RNNs
- A simple extension, run the RNN in both directions
I hate this movie
RNN RNN RNN RNN RNN RNN RNN RNN concat concat concat
softmax
PRN
softmax
VB
softmax
DET
SLIDE 89 Bi-RNNs
- A simple extension, run the RNN in both directions
I hate this movie
RNN RNN RNN RNN RNN RNN RNN RNN concat concat concat concat
softmax
PRN
softmax
VB
softmax
DET
SLIDE 90 Bi-RNNs
- A simple extension, run the RNN in both directions
I hate this movie
RNN RNN RNN RNN RNN RNN RNN RNN concat concat concat concat
softmax
PRN
softmax
VB
softmax
DET
softmax
NN
SLIDE 91
Let’s Try it Out!
SLIDE 92
Recurrent Neural Networks in DyNet
SLIDE 93 Recurrent Neural Networks in DyNet
- Based on “*Builder” class (*=SimpleRNN/LSTM)
SLIDE 94 Recurrent Neural Networks in DyNet
- Based on “*Builder” class (*=SimpleRNN/LSTM)
# LSTM (layers=1, input=64, hidden=128, model) RNN = dy.SimpleRNNBuilder(1, 64, 128, model)
- Add parameters to model (once):
SLIDE 95 Recurrent Neural Networks in DyNet
- Based on “*Builder” class (*=SimpleRNN/LSTM)
# LSTM (layers=1, input=64, hidden=128, model) RNN = dy.SimpleRNNBuilder(1, 64, 128, model)
- Add parameters to model (once):
- Add parameters to CG and get initial state (per sentence):
s = RNN.initial_state()
SLIDE 96 Recurrent Neural Networks in DyNet
- Based on “*Builder” class (*=SimpleRNN/LSTM)
# LSTM (layers=1, input=64, hidden=128, model) RNN = dy.SimpleRNNBuilder(1, 64, 128, model)
- Add parameters to model (once):
- Add parameters to CG and get initial state (per sentence):
s = RNN.initial_state()
- Update state and access (per input word/character):
s = s.add_input(x_t) h_t = s.output()
SLIDE 97 RNNLM Example: Parameter Initialization
# Lookup parameters for word embeddings WORDS_LOOKUP = model.add_lookup_parameters((nwords, 64)) # Word-level RNN (layers=1, input=64, hidden=128, model) RNN = dy.SimpleRNNBuilder(1, 64, 128, model) # Softmax weights/biases on top of RNN outputs W_sm = model.add_parameters((nwords, 128)) b_sm = model.add_parameters(nwords)
SLIDE 98 RNNLM Example: Sentence Initialization
# Build the language model graph def calc_lm_loss(wids): dy.renew_cg() # parameters -> expressions W_exp = dy.parameter(W_sm) b_exp = dy.parameter(b_sm) # add parameters to CG and get state f_init = RNN.initial_state() # get the word vectors for each word ID wembs = [WORDS_LOOKUP[wid] for wid in wids] # Start the rnn by inputting "<s>" s = f_init.add_input(wembs[-1])
…
SLIDE 99 RNNLM Example: Loss Calculation and State Update
# process each word ID and embedding losses = [] for wid, we in zip(wids, wembs): # calculate and save the softmax loss score = W_exp * s.output() + b_exp loss = dy.pickneglogsoftmax(score, wid) losses.append(loss) # update the RNN state with the input s = s.add_input(we) # return the sum of all losses return dy.esum(losses)
…
SLIDE 100
Code Examples
sentiment-rnn.py
SLIDE 101
RNN Problems and Alternatives
SLIDE 102 Vanishing Gradient
- Gradients decrease as they get pushed back
SLIDE 103 Vanishing Gradient
- Gradients decrease as they get pushed back
- Why? “Squashed” by non-linearities or small
weights in matrices.
SLIDE 104 A Solution:
Long Short-term Memory
(Hochreiter and Schmidhuber 1997)
SLIDE 105 A Solution:
Long Short-term Memory
(Hochreiter and Schmidhuber 1997)
- Basic idea: make additive connections between
time steps
SLIDE 106 A Solution:
Long Short-term Memory
(Hochreiter and Schmidhuber 1997)
- Basic idea: make additive connections between
time steps
- Addition does not modify the gradient, no vanishing
SLIDE 107 A Solution:
Long Short-term Memory
(Hochreiter and Schmidhuber 1997)
- Basic idea: make additive connections between
time steps
- Addition does not modify the gradient, no vanishing
- Gates to control the information flow
SLIDE 108
LSTM Structure
SLIDE 109
Other Alternatives
SLIDE 110 Other Alternatives
- Lots of variants of LSTMs (Hochreiter and
Schmidhuber, 1997)
SLIDE 111 Other Alternatives
- Lots of variants of LSTMs (Hochreiter and
Schmidhuber, 1997)
- Gated recurrent units (GRUs; Cho et al., 2014)
SLIDE 112 Other Alternatives
- Lots of variants of LSTMs (Hochreiter and
Schmidhuber, 1997)
- Gated recurrent units (GRUs; Cho et al., 2014)
- All follow the basic paradigm of “take input, update
state”
SLIDE 113
Code Examples
sentiment-lstm.py lm-lstm.py
SLIDE 114
Efficiency/Memory Tricks
SLIDE 115
Handling Mini-batching
SLIDE 116 Handling Mini-batching
- Mini-batching makes things much faster!
SLIDE 117 Handling Mini-batching
- Mini-batching makes things much faster!
- But mini-batching in RNNs is harder than in feed-
forward networks
SLIDE 118 Handling Mini-batching
- Mini-batching makes things much faster!
- But mini-batching in RNNs is harder than in feed-
forward networks
- Each word depends on the previous word
SLIDE 119 Handling Mini-batching
- Mini-batching makes things much faster!
- But mini-batching in RNNs is harder than in feed-
forward networks
- Each word depends on the previous word
- Sequences are of various length
SLIDE 120 Mini-batching Method
this is an example </s> this is another </s>
SLIDE 121 Mini-batching Method
this is an example </s> this is another </s> </s> Padding
SLIDE 122 Mini-batching Method
this is an example </s> this is another </s> </s> Padding Loss Calculation
SLIDE 123 Mini-batching Method
this is an example </s> this is another </s> </s> Padding Loss Calculation Mask
1
1
1
1
1
SLIDE 124 Mini-batching Method
this is an example </s> this is another </s> </s> Padding Loss Calculation Mask
1
1
1
1
1
SLIDE 125 Mini-batching Method
this is an example </s> this is another </s> </s> Padding Loss Calculation Mask
1
1
1
1
1
SLIDE 126 Mini-batching Method
this is an example </s> this is another </s> </s> Padding Loss Calculation Mask
1
1
1
1
1
(Or use DyNet automatic mini-batching, much easier but a bit slower)
SLIDE 127
Bucketing/Sorting
SLIDE 128 Bucketing/Sorting
- If we use sentences of different lengths, too much
padding and sorting can result in decreased performance
SLIDE 129 Bucketing/Sorting
- If we use sentences of different lengths, too much
padding and sorting can result in decreased performance
- To remedy this: sort sentences so similarly-
lengthed sentences are in the same batch
SLIDE 130
Code Example
lm-minibatch.py
SLIDE 131
Handling Long Sequences
SLIDE 132 Handling Long Sequences
- Sometimes we would like to capture long-term
dependencies over long sequences
SLIDE 133 Handling Long Sequences
- Sometimes we would like to capture long-term
dependencies over long sequences
- e.g. words in full documents
SLIDE 134 Handling Long Sequences
- Sometimes we would like to capture long-term
dependencies over long sequences
- e.g. words in full documents
- However, this may not fit on (GPU) memory
SLIDE 135 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie It is so bad 1st Pass 2nd Pass
SLIDE 136 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie It is so bad 1st Pass 2nd Pass
SLIDE 137 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie It is so bad 1st Pass 2nd Pass
SLIDE 138 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie It is so bad 1st Pass 2nd Pass
SLIDE 139 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie It is so bad 1st Pass 2nd Pass
SLIDE 140 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie It is so bad 1st Pass 2nd Pass
SLIDE 141 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie
RNN
It is so bad 1st Pass 2nd Pass
SLIDE 142 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie
RNN RNN
It is so bad 1st Pass 2nd Pass
SLIDE 143 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie
RNN RNN RNN
It is so bad 1st Pass 2nd Pass
SLIDE 144 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie
RNN RNN RNN RNN
It is so bad 1st Pass 2nd Pass
SLIDE 145 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie
RNN RNN RNN RNN
It is so bad state only, no backprop 1st Pass 2nd Pass
SLIDE 146 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie
RNN RNN RNN RNN
It is so bad state only, no backprop 1st Pass 2nd Pass
SLIDE 147 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie
RNN RNN RNN RNN
It is so bad state only, no backprop 1st Pass 2nd Pass
SLIDE 148 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie
RNN RNN RNN RNN
It is so bad state only, no backprop 1st Pass 2nd Pass
SLIDE 149 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie
RNN RNN RNN RNN
It is so bad state only, no backprop 1st Pass 2nd Pass
SLIDE 150 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie
RNN RNN RNN RNN
It is so bad
RNN
state only, no backprop 1st Pass 2nd Pass
SLIDE 151 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie
RNN RNN RNN RNN
It is so bad
RNN RNN
state only, no backprop 1st Pass 2nd Pass
SLIDE 152 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie
RNN RNN RNN RNN
It is so bad
RNN RNN RNN
state only, no backprop 1st Pass 2nd Pass
SLIDE 153 Truncated BPTT
- Backprop over shorter segments, initialize w/ the
state from the previous segment I hate this movie
RNN RNN RNN RNN
It is so bad
RNN RNN RNN RNN
state only, no backprop 1st Pass 2nd Pass
SLIDE 154
Pre-training/Transfer
for RNNs
SLIDE 155
RNN Strengths/Weaknesses
SLIDE 156 RNN Strengths/Weaknesses
- RNNs, particularly deep RNNs/LSTMs, are quite
powerful and flexible
SLIDE 157 RNN Strengths/Weaknesses
- RNNs, particularly deep RNNs/LSTMs, are quite
powerful and flexible
- But they require a lot of data
SLIDE 158 RNN Strengths/Weaknesses
- RNNs, particularly deep RNNs/LSTMs, are quite
powerful and flexible
- But they require a lot of data
- Also have trouble with weak error signals passed
back from the end of the sentence
SLIDE 159
Pre-training/Transfer
SLIDE 160 Pre-training/Transfer
- Train for one task, solve another
SLIDE 161 Pre-training/Transfer
- Train for one task, solve another
- Pre-training task: Big data, easy to learn
SLIDE 162 Pre-training/Transfer
- Train for one task, solve another
- Pre-training task: Big data, easy to learn
- Main task: Small data, harder to learn
SLIDE 163 Example:
LM -> Sentence Classifier
(Luong et al. 2015)
SLIDE 164 Example:
LM -> Sentence Classifier
(Luong et al. 2015)
- Train a language model first: lots of data, easy-to-
learn objective
SLIDE 165 Example:
LM -> Sentence Classifier
(Luong et al. 2015)
- Train a language model first: lots of data, easy-to-
learn objective
- Sentence classification: little data, hard-to-learn
- bjective
SLIDE 166 Example:
LM -> Sentence Classifier
(Luong et al. 2015)
- Train a language model first: lots of data, easy-to-
learn objective
- Sentence classification: little data, hard-to-learn
- bjective
- Results in much better classifications, competitive
- r better than CNN-based methods
SLIDE 167
Why Pre-training?
SLIDE 168 Why Pre-training?
- The model learns consistencies in the data (Karpathy et al. 2015)
_
SLIDE 169 Why Pre-training?
- The model learns consistencies in the data (Karpathy et al. 2015)
_
SLIDE 170 Why Pre-training?
- The model learns consistencies in the data (Karpathy et al. 2015)
_
- Model learns syntax (Shi et al. 2017) or semantics (Radford et al. 2017)
SLIDE 171
Questions?