CS11-747 Neural Networks for NLP
Recurrent Neural Networks
Graham Neubig
Site https://phontron.com/class/nn4nlp2020/
Recurrent Neural Networks Graham Neubig Site - - PowerPoint PPT Presentation
CS11-747 Neural Networks for NLP Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2020/ NLP and Sequential Data NLP is full of sequential data Words in sentences Characters in words Sentences in
CS11-747 Neural Networks for NLP
Graham Neubig
Site https://phontron.com/class/nn4nlp2020/
He does not have very much confidence in himself. She does not have very much confidence in herself. The reign has lasted as long as the life of the queen. The rain has lasted as long as the life of the clouds.
The trophy would not fit in the brown suitcase because it was too big. The trophy would not fit in the brown suitcase because it was too small.
(from Winograd Schema Challenge: http://commonsensereasoning.org/winograd.html) Trophy Suitcase
Feed-forward NN
lookup
transform
predict context label
Recurrent NN
lookup
transform
predict context label
I hate this movie
RNN RNN RNN RNN predict label predict label predict label predict label
I hate this movie
RNN RNN RNN RNN predict prediction 1 predict predict predict prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 loss 1 loss 2 loss 3 loss 4 sum total loss
computation graph—we can run backprop
aggregated across all time steps
time” (BPTT)
sum total loss
I hate this movie
RNN RNN RNN RNN predict prediction 1 predict predict predict loss 1 loss 2 loss 3 loss 4 prediction 2 prediction 3 prediction 4 label 1 label 2 label 3 label 4 sum total loss
Parameters are shared! Derivatives are accumulated.
I hate this movie
RNN RNN RNN RNN predict prediction
I hate this movie
RNN RNN RNN RNN predict label predict label predict label predict label
RNN RNN RNN RNN
movie this hate I
predict hate predict this predict movie predict </s> RNN
<s>
predict I
each tag is the next word!
I hate this movie
RNN RNN RNN RNN RNN RNN RNN RNN concat concat concat concat
softmax
PRN
softmax
VB
softmax
DET
softmax
NN
weights in matrices.
(Hochreiter and Schmidhuber 1997)
time steps
Count length of sentence Sentiment
forward networks
this is an example </s> this is another </s> </s> Padding Loss Calculation Mask
1 1
1
1
1
(Or use DyNet automatic mini-batching, much easier but a bit slower)
padding and sorting can result in decreased performance
lengthed sentences are in the same batch
(Appleyard 2015)
for each time step
sequence computation supported by CuDNN
GPU call combine inputs into tensor, single GPU call
connections Additive
Non-linear
(Greffen et al. 2015)
types of architectures tested for LSTMs
LSTM quite good,
coupled input/ forget gates) reasonable
dependencies over long sequences
state from the previous segment I hate this movie
RNN RNN RNN RNN
It is so bad
RNN RNN RNN RNN
state only, no backprop 1st Pass 2nd Pass