IN5550 Neural Methods in Natural Language Processing Attention! - - PowerPoint PPT Presentation

in5550 neural methods in natural language processing
SMART_READER_LITE
LIVE PREVIEW

IN5550 Neural Methods in Natural Language Processing Attention! - - PowerPoint PPT Presentation

IN5550 Neural Methods in Natural Language Processing Attention! Vinit Ravishankar University of Oslo April 4, 2019 Coming up: Last Week Gated RNNs Structured predictions RNN applications Today Seq2seq Attention


slide-1
SLIDE 1

– IN5550 – Neural Methods in Natural Language Processing Attention!

Vinit Ravishankar

University of Oslo

April 4, 2019

slide-2
SLIDE 2

Coming up:

Last Week

◮ Gated RNNs ◮ Structured predictions ◮ RNN applications

Today

◮ Seq2seq ◮ Attention models

2

slide-3
SLIDE 3

Recap: unrolled RNNs

3

slide-4
SLIDE 4

Recap: unrolled RNNs

◮ Each state si and output yi depend on the full previous context, e.g.

s4 = R(R(R(R(x1, so), x2), x3)x4)

3

slide-5
SLIDE 5

Recap: unrolled RNNs

◮ Each state si and output yi depend on the full previous context, e.g.

s4 = R(R(R(R(x1, so), x2), x3)x4)

3

slide-6
SLIDE 6

Conditioned generation

◮ Generate words using an RNN and a ‘conditioning’ context vector c ◮ p(tj+1) = f(RNN([ˆ

tj, c], sj))

◮ Keep generating till you reach some maximum length, or generate </s>

4

slide-7
SLIDE 7

Seq2seq - basic mode

◮ Words go in, words come out.. ◮ Traditionally uses the last RNN state as the conditioning context

5

slide-8
SLIDE 8

Seq2seq - why?

◮ Machine translation

6

slide-9
SLIDE 9

Seq2seq - why?

◮ Summarisation

7

slide-10
SLIDE 10

Seq2seq - why?

◮ Conversation modelling

8

slide-11
SLIDE 11

Seq2seq

9

slide-12
SLIDE 12

Seq2seq

9

slide-13
SLIDE 13

Seq2seq

9

slide-14
SLIDE 14

Seq2seq

9

slide-15
SLIDE 15

Seq2seq

9

slide-16
SLIDE 16

Seq2seq on steroids

“You can’t cram the meaning of a whole —ing sentence into a single —ing vector!” – Ray Mooney

10

slide-17
SLIDE 17

Seq2seq on steroids

“You can’t cram the meaning of a whole —ing sentence into a single —ing vector!” – Ray Mooney

◮ He’s not wrong, we can barely cram the meaning of a word into a single

vector

◮ We could use multiple vectors though

10

slide-18
SLIDE 18

Attention

Idea: use a weighted sum of input RNN states for every output RNN state

11

slide-19
SLIDE 19

Attention

Idea: use a weighted sum of input RNN states for every output RNN state

11

slide-20
SLIDE 20

Attention - mandatory maths

Recap, without attention: p(tj+1) = f(RNN([ˆ tj, c], sj])) We’re using a separate context for every output element, i.e. a bunch of cjs for j = 1, 2, ..., Ty

12

slide-21
SLIDE 21

Attention - mandatory maths

Recap, without attention: p(tj+1) = f(RNN([ˆ tj, c], sj])) We’re using a separate context for every output element, i.e. a bunch of cjs for j = 1, 2, ..., Ty

◮ cj is a weighted sum of input vectors, i.e. a weighted sum of

h1, h2, ..., hTx

12

slide-22
SLIDE 22

Attention - mandatory maths

Recap, without attention: p(tj+1) = f(RNN([ˆ tj, c], sj])) We’re using a separate context for every output element, i.e. a bunch of cjs for j = 1, 2, ..., Ty

◮ cj is a weighted sum of input vectors, i.e. a weighted sum of

h1, h2, ..., hTx

◮ The weights α are conditioned by the input state that they are

weighting (i) and the output state they’re generating (j)

◮ i.e., cj = Tx i=1 αijhi ◮ In English – the context vector that we use to generate the jth output

is the weighted sum of all the input hidden states, i.

12

slide-23
SLIDE 23

Attention - mandatory maths

cj = Tx

i=1 αijsi ◮ How do we calculate these weights?

13

slide-24
SLIDE 24

Attention - mandatory maths

cj = Tx

i=1 αijsi ◮ How do we calculate these weights? ◮ Learn them while learning to translate.

13

slide-25
SLIDE 25

Attention - mandatory maths

cj = Tx

i=1 αijsi ◮ How do we calculate these weights? ◮ Learn them while learning to translate. ◮ Use a ‘relevance’ function a1 that tells you how relevant an input state

i is to an output token j

1Called an ‘alignment model’ 13

slide-26
SLIDE 26

Attention - mandatory maths

cj = Tx

i=1 αijsi ◮ How do we calculate these weights? ◮ Learn them while learning to translate. ◮ Use a ‘relevance’ function a1 that tells you how relevant an input state

i is to an output token j

◮ Relevances: eij = a(sj−1, hi) ◮ Weights: αij = softmax(eij) = exp(eij)

Tx

k=1 exp(ekj) 1Called an ‘alignment model’ 13

slide-27
SLIDE 27

Attention - tl;dr

Pay attention to a weighted combination of input states to generate the right output state

14

slide-28
SLIDE 28

Self-attention

John Lennon, 1967: love is all u need

15

slide-29
SLIDE 29

Self-attention

John Lennon, 1967: love is all u need Vaswani et al., 2017:

15

slide-30
SLIDE 30

Self-attention

Simple principle: instead of a target paying attention to different parts of the source, make the source pay attention to itself.

16

slide-31
SLIDE 31

Self-attention

Simple principle: instead of a target paying attention to different parts of the source, make the source pay attention to itself. Okay, maybe that wasn’t so simple.

16

slide-32
SLIDE 32

Self-attention

the man crossed the street because he fancied it

17

slide-33
SLIDE 33

Self-attention

the man crossed the street because he fancied it the man crossed the street because he fancied it

17

slide-34
SLIDE 34

Self-attention

the man crossed the street because he fancied it the man crossed the street because he fancied it the man crossed the street because he fancied it

17

slide-35
SLIDE 35

Self-attention

the man crossed the street because he fancied it the man crossed the street because he fancied it the man crossed the street because he fancied it

◮ By making parts of a sentence pay attention to other parts of itself, we

get fancier representations

◮ This can be an RNN replacement ◮ Where an RNN carries long-term information down a chain,

self-attention acts more like a tree

17

slide-36
SLIDE 36

Transformer

18

slide-37
SLIDE 37

Transformer

The important bit: The maths: Attention(Q, K, V ) = softmax(QKT √dk )V

19

slide-38
SLIDE 38

Transformer

Attention(Q, K, V ) = softmax(QKT √dk )V What’s happening at a token level:

◮ Obtain three representations of the input, Q, K and V - query, key and

value

20

slide-39
SLIDE 39

Transformer

Attention(Q, K, V ) = softmax(QKT √dk )V What’s happening at a token level:

◮ Obtain three representations of the input, Q, K and V - query, key and

value

◮ Obtain a set of relevance strengths: QKT . For words i and j, Qi · Kj

represents the strength of the association - exactly like in seq2seq attention.

20

slide-40
SLIDE 40

Transformer

Attention(Q, K, V ) = softmax(QKT √dk )V What’s happening at a token level:

◮ Obtain three representations of the input, Q, K and V - query, key and

value

◮ Obtain a set of relevance strengths: QKT . For words i and j, Qi · Kj

represents the strength of the association - exactly like in seq2seq attention.

◮ Scale it (stabler gradients, boring maths) and softmax for αs.

20

slide-41
SLIDE 41

Transformer

Attention(Q, K, V ) = softmax(QKT √dk )V What’s happening at a token level:

◮ Obtain three representations of the input, Q, K and V - query, key and

value

◮ Obtain a set of relevance strengths: QKT . For words i and j, Qi · Kj

represents the strength of the association - exactly like in seq2seq attention.

◮ Scale it (stabler gradients, boring maths) and softmax for αs. ◮ Unlike seq2seq, use different ‘value’ vectors to weight.

20

slide-42
SLIDE 42

Transformer

Attention(Q, K, V ) = softmax(QKT √dk )V What’s happening at a token level:

◮ Obtain three representations of the input, Q, K and V - query, key and

value

◮ Obtain a set of relevance strengths: QKT . For words i and j, Qi · Kj

represents the strength of the association - exactly like in seq2seq attention.

◮ Scale it (stabler gradients, boring maths) and softmax for αs. ◮ Unlike seq2seq, use different ‘value’ vectors to weight.

In a sense, this is exactly like seq2seq attention, except: a) non-recurrent representations, b) same source/target, c) different value vectors

20

slide-43
SLIDE 43

Adding heads

Revolutionary idea: if representations learn so much from attention, why not learn many attentions Multi-headed attention is many self-attentions

21

slide-44
SLIDE 44

Adding heads

Revolutionary idea: if representations learn so much from attention, why not learn many attentions Multi-headed attention is many self-attentions (Simplified) transformer:

21

slide-45
SLIDE 45

Transformer - why?

◮ it’s cool

22

slide-46
SLIDE 46

Transformer - why?

◮ State-of-the-art for en-de NMT when released, state-of-the-art for en-fr

(excluding ensembled)

◮ No recurrence - it’s extremely fast (“1/4th the training resources for

French”)

◮ Been used in a bunch of other tasks since

22

slide-47
SLIDE 47

What’s next?

  • Multitask learning

23

slide-48
SLIDE 48

What’s next?

  • Multitask learning

23