CS11-747 Neural Networks for NLP
Attention
Graham Neubig
Site https://phontron.com/class/nn4nlp2017/
Attention Graham Neubig Site - - PowerPoint PPT Presentation
CS11-747 Neural Networks for NLP Attention Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Encoder-decoder Models (Sutskever et al. 2014) Encoder kono eiga ga kirai </s> LSTM LSTM LSTM LSTM LSTM I hate this movie
CS11-747 Neural Networks for NLP
Graham Neubig
Site https://phontron.com/class/nn4nlp2017/
LSTM LSTM LSTM LSTM LSTM
</s>
LSTM LSTM LSTM LSTM argmax argmax argmax argmax
</s>
argmax
(Sutskever et al. 2014)
I hate this movie kono eiga ga kirai I hate this movie Encoder Decoder
the length of the sentence. this is an example this is an example “You can’t cram the meaning of a whole %&!$ing sentence into a single $&!*ing vector!” — Ray Mooney Problem!
these vectors, weighted by “attention weights”
kono eiga ga kirai Key Vectors I hate Query Vector a1=2.1 a2=-0.1 a3=0.3 a4=-1.0
softmax
α1=0.76 α2=0.08 α3=0.13 α4=0.03
states, like key vectors) by taking the weighted sum kono eiga ga kirai Value Vectors α1=0.76 α2=0.08 α3=0.13 α4=0.03 * * * *
a(q, k) = w|
2tanh(W1[q; k])
a(q, k) = q|Wk
larger
a(q, k) = q|k a(q, k) = q|k p |k|
et al. 2016)
(Vaswani et al. 2017)
attention over each sentence, then attention over each sentence in the document
(Cheng et al. 2016)
elements → context sensitive encodings! this is an example this is an example
content
covered
et al. 2015)
2016)
(Cohn et al. 2015)
correlated with attention this time
making the next decision
roughly similar in forward and backward directions
based on the trace of the matrix product for training in both directions
Y →X)
a-priori
blurred
by one
where to attend (Xu et al. 2015)
learning (see later classes)
summarization (?)
after ‘Mr.’”, etc.
independently learned heads (Vaswani et al. 2017)
heads for “copy” vs regular (Allamanis et al. 2016)
sequence model based entirely on attention
standard WMT datasets
multiplications
(Vaswani et al. 2017)
independently
in dot product when using large networks
don’t have RNN, can still distinguish positions
remain in reasonable range
learning rate of the Adam optimizer
training process
possible using big matrix multiplies
kono eiga ga kirai I hate this movie </s>