Eric Mintun
HEP-AI Journal Club May 15th, 2018
Eric Mintun HEP-AI Journal Club May 15th, 2018 Outline Motivating - - PowerPoint PPT Presentation
Eric Mintun HEP-AI Journal Club May 15th, 2018 Outline Motivating example and definition Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In International
Eric Mintun
HEP-AI Journal Club May 15th, 2018
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” In International Conference on Learning Representations, 2015. arXiv:1409.0473 [cs.CL] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. “Structured attention networks.” In International Conference on Learning Representations, 2017. arXiv: 1702.00887 [cs.CL] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin. “Attention is all you need.” In 31st Conference
The agreement on the European Economic Area was signed in 1992 . <end>
English French
R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N
c
Context Vector L’ accord sur la zone économique européenne a été signé en août 1992 . <end> R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N
sentences, fails later in sentence.
health’.
The agreement on the European Economic Area was signed in 1992 . <end> R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N L’ accord sur la zone économique européenne a été signé en août 1992 . <end> R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h13 si−1 α1i(h1, si−1)
ci ⊕
0 ≤ αji ≤ 1 X
j
αji = 1
match lets more of the value through:
Compare × Vi Ki Q
i
X
i
wi = 1 /w
The agreement on the R N N R N N R N N R N N R N N L’ accord sur la zone économique R N N R N N R N N R N N R N N R N N
ci
h1 h2 h3 h4 h5
⊕
α1i(h1, si−1) European
si−1
Query: si−1 Keys: hj Values: hj
Additive Attention
Compare: αji(hj, si−1)
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” In International Conference on Learning Representations, 2015. arXiv:1409.0473 [cs.CL] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. “Structured attention networks.” In International Conference on Learning Representations, 2017. arXiv: 1702.00887 [cs.CL] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin. “Attention is all you need.” In 31st Conference
known structure? E.g.:
connected subsequence in encoder (character to word conversion).
parsing, equation input and output).
changing annotation function. Add structure by dividing into cliques:
c = Ez∼p(z|x,q)[f(x, z)] =
n
X
i=1
p(z = i|x, q)xi
αi
αi(k, q) f(x, z) = xz z ∈ 1, . . . , n c = Ez∼p(z|x,q)[f(x, z)] = X
C
Ez∼p(zC|x,q)[fC(x, zC)] p(z|x, q; θ) = softmax X
C
θC(zC) !
zi ∈ 0, 1 f(x, z) =
n
X
i=1
{zi = 1}xi c = Ez1,...,zn[f(x, z)] =
n
X
i=1
p(zi = 1|x, q)xi p(zi = 1|x, q) = sigmoid(θi) p(z1, . . . , zn) = softmax n−1 X
i=1
θi,i+1(zi, zi+1) !
(a) (b) (c) Truth
depend on decoder’s location.
zij = 1 i j
p(z|x, q) = softmax @ {z is valid} X
i6=j
{zij = 1}θij 1 A cj =
n
X
i=1
p(zij = 1|x, q)xi
Simple Structured
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” In International Conference on Learning Representations, 2015. arXiv:1409.0473 [cs.CL] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. “Structured attention networks.” In International Conference on Learning Representations, 2017. arXiv: 1702.00887 [cs.CL] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin. “Attention is all you need.” In 31st Conference
sequential tasks?
value.
sequence which is fed into the next layer.
as additional input.
Regular attention: keys and values from encoder, query from decoder. Self attention: keys, values, queries all from previous layer Stacked a fixed N number of times Masked to prevent attending to words that were written later. Outputs probabilities for just the next word. Input entire sequence, size is n x dmodel Input sequence generated so far Pointwise add sinusoids
to the input features. All linear layers applied per position with weight sharing.
Learn linear projections into h separate dmodel/h size vectors Run h separate multiplicative attention steps. Scale the dot-product by (dmodel/h)1/2 After concat, dimension is dmodel again.
correlations and parallelization, and sometimes complexity.
n: sequence length d: representation length k: kernel size r: restriction size Using dilated convolutions,
RNNs and CNNs need a d x d matrix of weights, attention uses length d dot product. Whole sequence attends to every position
with CNN. Can see where network is ‘looking’.
instead of taking expectation value. No longer differentiable, so train as RL algorithm where choosing attention target is an action.
Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio. “Show, attend, and tell: neural image caption generation with visual attention.” In International Conference on Machine Learning, 2015. arXiv:1502.03044 [cs.LG]
analysis.
probability distribution for latent variables that annotate
RNNs, often less complex.
learning to align and translate.” In International Conference on Learning Representations, 2015. arXiv:1409.0473 [cs.CL]
International Conference on Learning Representations, 2017. arXiv:1702.00887 [cs.CL]
Łukasz Kaiser, Illia Polosukhin. “Attention is all you need.” In 31st Conference on Neural Information Processing Systems (NIPS 2017). arXiv:1706.03762 [cs.CL]
Richard S. Zemel, Yoshua Bengio. “Show, attend, and tell: neural image caption generation with visual attention.” In International Conference on Machine Learning, 2015. arXiv:1502.03044 [cs.LG]
recognition with visual attention.” In International Conference on Learning Representations, 2015. arXiv:1412.7755 [cs.LG]
should-go-to-eurovision-2014/