Eric Mintun HEP-AI Journal Club May 15th, 2018 Outline Motivating - - PowerPoint PPT Presentation

eric mintun
SMART_READER_LITE
LIVE PREVIEW

Eric Mintun HEP-AI Journal Club May 15th, 2018 Outline Motivating - - PowerPoint PPT Presentation

Eric Mintun HEP-AI Journal Club May 15th, 2018 Outline Motivating example and definition Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In International


slide-1
SLIDE 1

Eric Mintun

HEP-AI Journal Club May 15th, 2018

slide-2
SLIDE 2

Outline

  • Motivating example and definition
  • Generalizations and a little theory
  • Why attention might be better than RNNs and CNNs

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” In International Conference on Learning Representations, 2015. arXiv:1409.0473 [cs.CL] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. “Structured attention networks.” In International Conference on Learning Representations, 2017. arXiv: 1702.00887 [cs.CL] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin. “Attention is all you need.” In 31st Conference

  • n Neural Information Processing Systems (NIPS 2017). arXiv:1706.03762 [cs.CL]
slide-3
SLIDE 3

Translation

The agreement on the European Economic Area was signed in 1992 . <end>

English French

R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N

c

Context Vector L’ accord sur la zone économique européenne a été signé en août 1992 . <end> R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N

slide-4
SLIDE 4

Translation

  • Fixed-size context vector struggles with long

sentences, fails later in sentence.

  • Underlined portion becomes ‘based on his state of

health’.

slide-5
SLIDE 5

Translation w/ Attention

The agreement on the European Economic Area was signed in 1992 . <end> R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N L’ accord sur la zone économique européenne a été signé en août 1992 . <end> R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N R N N h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h13 si−1 α1i(h1, si−1)

ci ⊕

0 ≤ αji ≤ 1 X

j

αji = 1

slide-6
SLIDE 6

Translation w/ Attention

slide-7
SLIDE 7

Translation w/ Attention

slide-8
SLIDE 8

Attention

  • Attention consists of learned key-value pairs.
  • Input query is compared against the key. A better

match lets more of the value through:

  • Additive compare: Q and K fed into neural net
  • Multiplicative compare: Dot-product Q and K

Compare × Vi Ki Q

  • uti

X

i

  • wi

X

i

wi = 1 /w

slide-9
SLIDE 9

Keys/Values for Example

The agreement on the R N N R N N R N N R N N R N N L’ accord sur la zone économique R N N R N N R N N R N N R N N R N N

ci

h1 h2 h3 h4 h5

α1i(h1, si−1) European

· · · · · ·

si−1

Query: si−1 Keys: hj Values: hj

Additive Attention

Compare: αji(hj, si−1)

slide-10
SLIDE 10

Outline

  • Motivating example and definition
  • Generalizations and a little theory
  • Why attention might be better than RNNs and CNNs

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” In International Conference on Learning Representations, 2015. arXiv:1409.0473 [cs.CL] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. “Structured attention networks.” In International Conference on Learning Representations, 2017. arXiv: 1702.00887 [cs.CL] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin. “Attention is all you need.” In 31st Conference

  • n Neural Information Processing Systems (NIPS 2017). arXiv:1706.03762 [cs.CL]
slide-11
SLIDE 11

Structured Attention

  • What if we know trained attention should have a

known structure? E.g.:

  • Each output by decoder should attend to a

connected subsequence in encoder (character to word conversion).

  • Output sequence is organized as a tree (sentence

parsing, equation input and output).

slide-12
SLIDE 12

Structured Attention

  • Attention weights define a probability
  • distribution. Write context vector as:
  • Generalize this by adding more latent variables,

changing annotation function. Add structure by dividing into cliques:

c = Ez∼p(z|x,q)[f(x, z)] =

n

X

i=1

p(z = i|x, q)xi

αi

αi(k, q) f(x, z) = xz z ∈ 1, . . . , n c = Ez∼p(z|x,q)[f(x, z)] = X

C

Ez∼p(zC|x,q)[fC(x, zC)] p(z|x, q; θ) = softmax X

C

θC(zC) !

slide-13
SLIDE 13

Subsequence Attention

  • (a) original unstructured attention network
  • (b) 1 independent binary latent variable per input:
  • (c) probability of each z depends on neighbors.

zi ∈ 0, 1 f(x, z) =

n

X

i=1

{zi = 1}xi c = Ez1,...,zn[f(x, z)] =

n

X

i=1

p(zi = 1|x, q)xi p(zi = 1|x, q) = sigmoid(θi) p(z1, . . . , zn) = softmax n−1 X

i=1

θi,i+1(zi, zi+1) !

slide-14
SLIDE 14

Subsequence Attention

(a) (b) (c) Truth

slide-15
SLIDE 15

Tree Attention

  • Task:
  • Latent variables if symbol has parent :
  • Context vector per symbol that attends to its parent in the tree:
  • No input query in this case, since a symbol’s parent doesn’t

depend on decoder’s location.

zij = 1 i j

p(z|x, q) = softmax @ {z is valid} X

i6=j

{zij = 1}θij 1 A cj =

n

X

i=1

p(zij = 1|x, q)xi

slide-16
SLIDE 16

Tree Attention

Simple Structured

slide-17
SLIDE 17

Outline

  • Motivating example and definition
  • Generalizations and a little theory
  • Why attention might be better than RNNs and CNNs

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” In International Conference on Learning Representations, 2015. arXiv:1409.0473 [cs.CL] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. “Structured attention networks.” In International Conference on Learning Representations, 2017. arXiv: 1702.00887 [cs.CL] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin. “Attention is all you need.” In 31st Conference

  • n Neural Information Processing Systems (NIPS 2017). arXiv:1706.03762 [cs.CL]
slide-18
SLIDE 18

Attention Is All You Need

  • Can we replace CNNs and RNNs with attention for

sequential tasks?

  • Self attention: the sequence is the query, key, and

value.

  • Stack attention layers: output of attention layer is a

sequence which is fed into the next layer.

  • Attention loses positional information; must insert

as additional input.

slide-19
SLIDE 19

Attention Is All You Need

Regular attention: keys and values from encoder, query from decoder. Self attention: keys, values, queries all from previous layer Stacked a fixed N number of times Masked to prevent attending to words that were written later. Outputs probabilities for just the next word. Input entire sequence, size is n x dmodel Input sequence generated so far Pointwise add sinusoids

  • f different frequencies

to the input features. All linear layers applied per position with weight sharing.

slide-20
SLIDE 20

Multi-Head Attention

Learn linear projections into h separate dmodel/h size vectors Run h separate multiplicative attention steps. Scale the dot-product by (dmodel/h)1/2 After concat, dimension is dmodel again.

slide-21
SLIDE 21

Self Attention

  • Why? Self-attention improves long-range

correlations and parallelization, and sometimes complexity.

n: sequence length d: representation length k: kernel size r: restriction size Using dilated convolutions,

  • therwise O(n/k)

RNNs and CNNs need a d x d matrix of weights, attention uses length d dot product. Whole sequence attends to every position

slide-22
SLIDE 22

Attention is All You Need

slide-23
SLIDE 23

Other Cool Things

  • Image captioning: like translation but replace encoder

with CNN. Can see where network is ‘looking’.

  • Hard attention: sample from probability distribution

instead of taking expectation value. No longer differentiable, so train as RL algorithm where choosing attention target is an action.

Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio. “Show, attend, and tell: neural image caption generation with visual attention.” In International Conference on Machine Learning, 2015. arXiv:1502.03044 [cs.LG]

slide-24
SLIDE 24

Summary

  • Attention is an architecture-level construct for sequence

analysis.

  • It is essentially learned, differentiable dictionary look-up.
  • More generally, it is an input-dependent, learned

probability distribution for latent variables that annotate

  • utput values.
  • Better long range correlation and parallelization than

RNNs, often less complex.

  • Produces human interpretable intermediate data.
slide-25
SLIDE 25

References

  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly

learning to align and translate.” In International Conference on Learning Representations, 2015. arXiv:1409.0473 [cs.CL]

  • Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. “Structured attention networks.” In

International Conference on Learning Representations, 2017. arXiv:1702.00887 [cs.CL]

  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,

Łukasz Kaiser, Illia Polosukhin. “Attention is all you need.” In 31st Conference on Neural Information Processing Systems (NIPS 2017). arXiv:1706.03762 [cs.CL]

  • Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov,

Richard S. Zemel, Yoshua Bengio. “Show, attend, and tell: neural image caption generation with visual attention.” In International Conference on Machine Learning, 2015. arXiv:1502.03044 [cs.LG]

  • Hard attention example: Jimmy Lei Ba, Volodymyr Mnih, Koray Kavukcuoglu. “Multiple object

recognition with visual attention.” In International Conference on Learning Representations, 2015. arXiv:1412.7755 [cs.LG]

  • Title page picture: https://eurovisionireland.net/2014/02/24/lithuania-which-version-of-attention-

should-go-to-eurovision-2014/