Natural Language Processing with Deep Learning Sequence-to-sequence - - PowerPoint PPT Presentation

β–Ά
natural language processing with deep learning sequence
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing with Deep Learning Sequence-to-sequence - - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Institute of Computational Perception Agenda Sequence-to-sequence


slide-1
SLIDE 1

Institute of Computational Perception

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention

Navid Rekab-Saz

navid.rekabsaz@jku.at Institute of Computational Perception

slide-2
SLIDE 2

Agenda

  • Sequence-to-sequence models
  • Attention Mechanism
  • seq2seq with Attention

Some slides are adopted from http://web.stanford.edu/class/cs224n/

slide-3
SLIDE 3

Agenda

  • Sequence-to-sequence models
  • Attention Mechanism
  • seq2seq with Attention
slide-4
SLIDE 4

4

Sequence in – sequence out!

Β§ Several NLP tasks are defined as:

  • Given the source sequence π‘Œ = {𝑦("), 𝑦($), … , 𝑦(%)}
  • Create/Generate the target sequence 𝑍 = {𝑧("), 𝑧($), … , 𝑧(&)}

π‘Œ 𝑍

Was mich nicht umbringt, macht mich stΓ€rker.

  • F. Nietzsche

What does not kill me makes me stronger.

Machine Translation

Then the woman went to to the bank to deposit her cash . RB DT NN VBD TO DT NN TO VB PRP$ NN .

POS Tagging

How tall is Stephansdom? [Heightof, ., Stephansdom]

Semantic parsing

slide-5
SLIDE 5

5

Sequence in – sequence out!

Β§ Tasks such as:

  • Machine Translation (source language β†’ target language)
  • Summarization (long text β†’ short text)
  • Dialogue (previous utterances β†’ next utterance)
  • Code generation (natural language β†’ SQL/Python code)
  • Named entity recognition
  • Dependency/semantic/ POS Parsing (input text β†’ output parse as

sequence)

but also …

  • Image captioning (image β†’ caption)
  • Automatic Speech Recognition (speech β†’ manuscript)

Image captioning

some elephants standing around a tall tree

slide-6
SLIDE 6

6

Machine Translation (MT)

Β§ A long-history (since 1950) Β§ Statistical Machine Translation (1990-2010) – and also Neural MT – use large amount of parallel data to calculate:

argmax

!

𝑄(𝑍|π‘Œ)

Β§ Challenges:

  • Alignment
  • Common sense
  • Idioms!
  • Low-resource language pairs

https://en.wikipedia.org/wiki/Rosetta_Stone

slide-7
SLIDE 7

7

Machine Translation (MT) – Evaluation Β§ BLEU (Bilingual Evaluation Understudy) Β§ BLEU computes a similarity score between the machine-written translation to one or several human- written translation(s), based on:

  • n-gram precision (usually for 1, 2, 3 and 4-grams)
  • plus a penalty for too-short machine translations

Β§ BLEU is precision-based, while ROUGE is recall-based

Details of how to calculate BLEU: https://www.coursera.org/lecture/nlp-sequence-models/bleu-score-

  • ptional-kC2HD
slide-8
SLIDE 8

8

Sequence-to-sequence model

Β§ Sequence-to-sequence model (aka seq2seq) is the neural network architecture to approach …

  • given the source sequence π‘Œ = {𝑦("), 𝑦($), … , 𝑦(%)},
  • generate the target sequence 𝑍 = {𝑧("), 𝑧($), … , 𝑧(&)}

Β§ A seq2seq model first creates a model to estimate the conditional probability:

𝑄(𝑍|π‘Œ)

Β§ and then generates a new sequence π‘βˆ— by solving:

π‘βˆ— = argmax

!

𝑄(𝑍|π‘Œ)

slide-9
SLIDE 9

9

Seq2seq model

Β§ In fact, a seq2seq model is a conditional Language Model Β§ It calculates the probability of the next word of target sequence, conditioned on the previous words of target sequence and the source sequence: for 𝑧(")β†’ 𝑄(𝑧(")|π‘Œ) for 𝑧($)β†’ 𝑄(𝑧($)|π‘Œ, 𝑧(")) … for 𝑧(()β†’ 𝑄(𝑧(()|π‘Œ, 𝑧("), … , 𝑧(()")) … and for whole the target sequence:

𝑄 𝑍 π‘Œ = 𝑄 𝑧 ! π‘Œ ×𝑄 𝑧 " π‘Œ, 𝑧 ! Γ— ⋯×𝑄 𝑧 # π‘Œ, 𝑧 ! , … , 𝑧 #$! 𝑄 𝑍 π‘Œ = *

%&! #

𝑄(𝑧 % |π‘Œ, 𝑧 ! , … , 𝑧 %$! )

slide-10
SLIDE 10

10

Seq2seq – steps

Β§ Like Language Modeling, we … Β§ … design a model that predicts the probabilities of the next words of the target sequence, one after each other (in auto- regressive fashion): 𝑄(𝑧(()|π‘Œ, 𝑧("), … , 𝑧(()")) Β§ We train the model by maximizing these probabilities for the correct next words, appearing in training data Β§ At inference time (or during decoding), we use the model to generate new target sequences, that have high generation probabilities: 𝑄 𝑍 π‘Œ

slide-11
SLIDE 11

11

𝑭 π’š($) π’Š($)

RNN' RNN' RNN' RNN'

𝑭 π’š(&) 𝑦(&) 𝑭 π’š(') 𝑦(') 𝑭 π’š(() π’Š(&) π’Š(') π’Š(() 𝑦(()

< eos >

𝑦($)

< sos >

1 𝒛(') 𝑿

RNN( RNN( RNN(

𝑽 𝒛($) 𝒕($) 𝑧($)

< sos >

𝑽 𝒛(&) 𝑧(&) 𝑽 𝒛(') 𝑧(') 𝒕(&) 𝒕(') …

) 𝒛("): predicted probability distribution of the next target word, given the source sequence and previous target words

EN ENCOD ODER ER DE DECODE DER

Seq2seq with two RNNs Probability of appearance of the next target word:

𝑄 𝑧 ) π‘Œ, 𝑧 ! , 𝑧 " , 𝑧 * = 0 𝑧+ !

(*)

slide-12
SLIDE 12

12

Seq2seq with two RNNs – formulation Β§ There are two sets of vocabularies

  • π•Ž/ is the set of vocabularies for source sequences
  • π•Ž0 is the set of vocabularies for target sequences

EN ENCODER ER

Β§ Encoder embedding

  • Encoder embeddings for source words (π•Ž/) β†’ 𝑭
  • Embedding of the source word 𝑦(.) (at time step π‘š) β†’ π’š(1)

Β§ Encoder RNN:

π’Š(1) = RNN(π’Š 1)" , π’š(1))

Parameters are shown in red

slide-13
SLIDE 13

13

Seq2seq with two RNNs – formulation

DE DECODE DER

Β§ Decoder embedding

  • Decoder embeddings at input for target words (π•Ž0) β†’ 𝑽
  • Embedding of the target word 𝑧(2) (at time step 𝑒) β†’ 𝒛(2)

Β§ Decoder RNN 𝒕(+) = RNN(𝒕 +,$ , 𝒛(+))

  • The values of the last hidden state of the encoder RNN are

passed to the initial hidden state of the decoder RNN:

𝒕(-) = π’Š .

Parameters are shown in red

slide-14
SLIDE 14

14

Seq2seq with two RNNs – formulation

DE DECODE DER

Β§ Decoder output prediction

  • Predicted probability distribution of words at the next time step:

1 𝒛(+) = softmax 𝑿𝒕 + + 𝒄 ∈ ℝ π•Ž!

  • Probability of the next target word (at time step 𝑒 + 1):

𝑄 𝑧(+0$) π‘Œ, 𝑧 $ , … , 𝑧(+,$), 𝑧(+) =C 𝑧1(#$%)

(+)

Parameters are shown in red

slide-15
SLIDE 15

15

Training Seq2seq

Β§ Training a seq2seq is the same as training a Language Model

  • We predict the next word, calculate loss, backpropagate, and update

parameters

  • Since seq2seq is an end-to-end model, gradient flows from loss to all

parameters (both RNNs and embeddings)

Β§ Loss function: Negative Log Likelihood of the predicted probability of the correct next target word 𝑧 23" β„’(2) = βˆ’ log < 𝑧4 $%&

2

= βˆ’ log 𝑄 𝑧 23" π‘Œ, 𝑧 " , … , 𝑧(2) Β§ Overall loss: β„’ =

" & βˆ‘25" &

β„’(2)

slide-16
SLIDE 16

16

RNN' RNN' RNN'

𝑭 π’š($) π’Š($)

RNN'

𝑭 π’š(&) 𝑦(&) 𝑭 π’š(') 𝑦(') 𝑭 π’š(() π’Š(&) π’Š(') π’Š(() 𝑧($)

< sos >

𝑧(&) 𝑧(') 𝑦(()

< eos >

𝑦($)

< sos >

Training Seq2seq

slide-17
SLIDE 17

17

RNN' RNN' RNN'

𝑭 π’š($) π’Š($) 1 𝒛($)

RNN'

𝑭 π’š(&) 𝑦(&) 𝑭 π’š(') 𝑦(') 𝑭 π’š(() π’Š(&) π’Š(') π’Š(()

RNN(

𝑽 𝒛($) 𝒕($) 𝑧($)

< sos >

𝑧(&) 𝑧(') 𝑦(()

< eos >

𝑦($)

< sos >

𝑿 Training Seq2seq β„’($)

NLL of 𝑧(')

slide-18
SLIDE 18

18

RNN' RNN' RNN'

𝑭 π’š($) π’Š($) 1 𝒛($)

RNN'

𝑭 π’š(&) 𝑦(&) 𝑭 π’š(') 𝑦(') 𝑭 π’š(() π’Š(&) π’Š(') π’Š(()

RNN( RNN(

𝑽 𝒛($) 𝒕($) 𝑧($)

< sos >

𝑽 𝒛(&) 𝑧(&) 𝑧(') 𝒕(&) 𝑦(()

< eos >

𝑦($)

< sos >

𝑿 Training Seq2seq β„’($) 1 𝒛(&) 𝑿 β„’(&)

NLL of 𝑧(()

slide-19
SLIDE 19

19

RNN' RNN' RNN'

𝑭 π’š($) π’Š($)

RNN'

𝑭 π’š(&) 𝑦(&) 𝑭 π’š(') 𝑦(') 𝑭 π’š(() π’Š(&) π’Š(') π’Š(()

RNN( RNN( RNN(

𝑽 𝒛($) 𝒕($) 𝑧($)

< sos >

𝑽 𝒛(&) 𝑧(&) 𝑽 𝒛(') 𝑧(') 𝒕(&) 𝒕(') 𝑦(()

< eos >

𝑦($)

< sos >

… Training Seq2seq 1 𝒛($) 𝑿 β„’($) 1 𝒛(&) 𝑿 β„’(&)

NLL of 𝑧())

1 𝒛(') 𝑿 β„’(')

slide-20
SLIDE 20

20

Parameters Β§ Encoder embeddings 𝑭 β†’ π•Ž3 ×𝑒3 Β§ Encoder RNN parameters Β§ Decoder embeddings 𝑽 β†’ π•Ž4 ×𝑒5 Β§ Decoder RNN parameters Β§ Decoder output projection 𝑿 →𝑒6Γ— π•Ž4

Β§ bias terms are discarded Β§ 𝑒", 𝑒#, 𝑒$ are embedding dimensions Β§ RNNs can be an LSTM, GRU, or vanilla (Elman) RNN

slide-21
SLIDE 21

21

Practical points: vocabs & embeddings Β§ In Machine Translation

  • Encoder and decoder vocabularies belong to two different

languages

Β§ In summarization

  • Encoder and decoder vocabularies are typically the same set

(as they are in the same language)

  • Encoder and decoder embeddings (𝑭 and 𝑽) can also share

parameters

Β§ Weight tying

  • can be done by sharing the parameters of 𝑽 and 𝑿 in decoder
slide-22
SLIDE 22

22

Decoding

Recap

Β§ After training, we use the model to generate a target sequence given the source sequence (decoding). We aim to find the

  • ptimal output sequence π‘βˆ— that maximizes 𝑄(𝑍|π‘Œ):

π‘βˆ— = argmax 𝑄(𝑍|π‘Œ) where 𝑄(𝑍|π‘Œ) for any arbitrary 𝑍 = {𝑧(!), 𝑧("), … , 𝑧(#)} is:

𝑄 𝑍 π‘Œ = ;

*+,

  • 𝑄(𝑧 * |π‘Œ, 𝑧 , , … , 𝑧 *., )

Β§ Question: among all possible 𝑍 sequences, how can we find π‘βˆ—?

slide-23
SLIDE 23

23

RNN' RNN' RNN'

𝑭 π’š($) π’Š($) 1 𝒛($)

RNN'

𝑭 π’š(&) 𝑦(&) 𝑭 π’š(') 𝑦(') 𝑭 π’š(() π’Š(&) π’Š(') π’Š(()

RNN(

𝑽 𝒛($) 𝒕($) C 𝑧($)

< sos >

𝑦(()

< eos >

𝑦($)

< sos >

𝑿

A first approach: Greedy decoding

C 𝑧(&)

selected word is the

  • ne with the highest

probability in 7 𝒛(&)

C 𝑧(&)

  • In each step, take the most probable

word

  • Use the generated word for the next

step, and continue

slide-24
SLIDE 24

24

RNN' RNN' RNN'

𝑭 π’š($) π’Š($) 1 𝒛($)

RNN'

𝑭 π’š(&) 𝑦(&) 𝑭 π’š(') 𝑦(') 𝑭 π’š(() π’Š(&) π’Š(') π’Š(()

RNN( RNN(

𝑽 𝒛($) 𝒕($) 𝑽 𝒛(&) C 𝑧(&) 𝒕(&) 𝑦(()

< eos >

𝑦($)

< sos >

𝑿 1 𝒛(&) 𝑿 C 𝑧(') C 𝑧(&) C 𝑧($)

< sos > selected word is the

  • ne with the highest

probability in 7 𝒛(()

C 𝑧(')

A first approach: Greedy decoding

  • In each step, take the most probable

word

  • Use the generated word for the next

step, and continue

slide-25
SLIDE 25

25

RNN' RNN' RNN'

𝑭 π’š($) π’Š($)

RNN'

𝑭 π’š(&) 𝑦(&) 𝑭 π’š(') 𝑦(') 𝑭 π’š(() π’Š(&) π’Š(') π’Š(()

RNN( RNN( RNN(

𝑽 𝒛($) 𝒕($) 𝑽 𝒛(&) C 𝑧(&) 𝑽 𝒛(') C 𝑧(') 𝒕(&) 𝒕(') 𝑦(()

< eos >

𝑦($)

< sos >

… 1 𝒛($) 𝑿 1 𝒛(&) 𝑿 1 𝒛(') 𝑿 C 𝑧(() C 𝑧($)

< sos > selected word is the

  • ne with the highest

probability in 7 𝒛())

C 𝑧(') C 𝑧(&) C 𝑧(()

A first approach: Greedy decoding

  • In each step, take the most probable

word

  • Use the generated word for the next

step, and continue

slide-26
SLIDE 26

26

Decoding Β§ Greedy decoding

  • Fast but …
  • … decisions are only based on immediate local knowledge
  • A non-optimal local decision can get propagated
  • It does not explore other decoding possibilities

Β§ Exhaustive search decoding

  • We can compute all possible decodings
  • It means a decoding tree with π•Ž( Γ—π‘ˆ leaves!
  • Far too expensive!

Β§ Beam search decoding

  • A compromise between exploration and exploitation!
slide-27
SLIDE 27

27

Beam search decoding

Β§ Core idea: on each time step of decoding, keep only 𝑙 most probable intermediary sequences (hypotheses)

  • 𝑙 is the beam size (in practice around 5 to 10)

Β§ To do it, beam search calculates of the following score for each hypothesis till time step π‘š (denoted as 𝑧 "…1 ) :

score 𝑧 !…. = log 𝑄 𝑧 !…. π‘Œ = I

2&! .

log 𝑄(𝑧 2 |π‘Œ, 𝑧 ! , … , 𝑧 2$! )

Β§ In each decoding step, we only keep 𝑙 hypotheses with the highest scores, and don’t continue the rest

slide-28
SLIDE 28

28

Beam search decoding – example

slide-29
SLIDE 29

29

Beam search decoding – example

slide-30
SLIDE 30

30

Beam search decoding – example

slide-31
SLIDE 31

31

Beam search decoding – example

slide-32
SLIDE 32

32

Beam search decoding – example

slide-33
SLIDE 33

33

Beam search decoding – example

slide-34
SLIDE 34

34

Beam search decoding – example

slide-35
SLIDE 35

35

Beam search decoding – example

slide-36
SLIDE 36

36

Beam search decoding – example

slide-37
SLIDE 37

37

Beam search decoding – example

slide-38
SLIDE 38

38

Beam search decoding – example

slide-39
SLIDE 39

39

Beam search decoding – example

slide-40
SLIDE 40

40

Beam search decoding – example

slide-41
SLIDE 41

41

Beam search decoding – last words! Β§ Achieving the optimal solution is not guaranteed …

  • ... but it is much more efficient than exhaustive search decoding

Β§ Stopping criteria:

  • Each hypothesis continues till reaching the <eos> token
  • Usually beam search decoding continues until:
  • We reach a cutoff timestep π‘ˆ (a hyperparameter), or
  • We have at least π‘œ completed hypotheses (another

hyperparameter)

slide-42
SLIDE 42

Agenda

  • Sequence-to-sequence models
  • Attention Mechanism
  • seq2seq with Attention
slide-43
SLIDE 43

43

Attention Networks

Β§ Attention is a general Deep Learning method to

  • obtain a composed representation (output) …
  • from an arbitrary size of representations (values) …
  • depending on a given representation (query)

Β§ General form of an attention network:

𝑷 = ATT(𝑹, 𝑾) 𝑾

ATT

𝑷 𝑹

slide-44
SLIDE 44

44

Attention Networks

𝑹 ×𝑒* 𝑾 ×𝑒+ 𝑹 ×𝑒+

𝑾

ATT

𝑷 𝑹

𝑷 = ATT(𝑹, 𝑾)

We sometime say, each query vector 𝒓 β€œattends to” the values

  • 𝑒!, 𝑒" are embedding dimensions of

query and value vectors, respectively

ATT π’˜$ π’˜& π’˜' π’˜( 𝒓& 𝒓$ 𝒑$ 𝒑&

slide-45
SLIDE 45

45

Attention Networks – definition

Formal definition:

Β§ Given a set of vector values 𝑾, and a set of vector queries 𝑹, attention is a technique to compute a weighted sum of the values, dependent on each query Β§ The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on Β§ The weight in the weighted sum – for each query on each value – is called attention, and denoted by 𝛽

slide-46
SLIDE 46

46

Attentions!

𝛽",0 is the attention of query 𝒓" on value π’˜0 𝜷" is the vector of attentions of query 𝒓" on value vectors 𝑾 𝜷" is a probability distribution 𝑔 is attention function 𝑔 𝑔

𝛽#,# 𝛽#,% 𝛽#,& 𝛽#,' 𝛽%,#𝛽%,% 𝛽%,&𝛽%,'

π’˜$ π’˜& π’˜' π’˜( 𝒓& 𝒓$ 𝒑$ 𝒑&

slide-47
SLIDE 47

47

Attention Networks – formulation

Β§ Given the query vector 𝒓(, an attention network assigns attention 𝛽(,> to each value vector π’˜> using attention function 𝑔:

𝛽G,H = 𝑔(𝒓G, π’˜H)

such that 𝜷( (vector of attentions for the 𝑗th query vector) forms a probability distribution Β§ The output regarding each query is the weighted sum of the value vectors (attentions as weights):

𝒑G = R

HI$ 𝑾

𝛽G,Hπ’˜H

slide-48
SLIDE 48

48

Attention variants

Basic dot-product attention

Β§ First, non-normalized attention scores: S 𝛽G,H = 𝒓G

Jπ’˜H

  • In this variant 𝑒3 = 𝑒4
  • There is no parameter to learn!

Β§ Then, softmax over values: 𝛽G,H = softmax(T 𝜷G)H Β§ Output (weighted sum): 𝒑G = βˆ‘HI$

𝑾 𝛽G,Hπ’˜H 𝑔 𝑔

𝛽#,# 𝛽#,% 𝛽#,& 𝛽#,' 𝛽%,#𝛽%,% 𝛽%,&𝛽%,'

π’˜$ π’˜& π’˜' π’˜( 𝒓& 𝒓$ 𝒑$ 𝒑&

slide-49
SLIDE 49

49

Attention variants

Multiplicative attention

Β§ First, non-normalized attention scores: S 𝛽G,H = 𝒓G

Jπ‘Ώπ’˜H

  • 𝑿 is a matrix of model parameters
  • provides a linear function for measuring

relations between query and value vectors

Β§ Then, softmax over values: 𝛽G,H = softmax(T 𝜷G)H Β§ Output (weighted sum): 𝒑G = βˆ‘HI$

𝑾 𝛽G,Hπ’˜H 𝑔 𝑔

𝛽#,# 𝛽#,% 𝛽#,& 𝛽#,' 𝛽%,#𝛽%,% 𝛽%,&𝛽%,'

π’˜$ π’˜& π’˜' π’˜( 𝒓& 𝒓$ 𝒑$ 𝒑&

slide-50
SLIDE 50

50

Attention variants

Additive attention

Β§ First, non-normalized attention scores: S 𝛽G,H = 𝒗Jtanh(𝒓G𝑿$ +π’˜H 𝑿&)

  • 𝑿!, 𝑿", and 𝒗 are model parameters
  • provides a non-linear function for measuring

relations between the query and value vectors

Β§ Then, softmax over values: 𝛽G,H = softmax(T 𝜷G)H Β§ Output (weighted sum): 𝒑G = βˆ‘HI$

𝑾 𝛽G,Hπ’˜H 𝑔 𝑔

𝛽#,# 𝛽#,% 𝛽#,& 𝛽#,' 𝛽%,#𝛽%,% 𝛽%,&𝛽%,'

π’˜$ π’˜& π’˜' π’˜( 𝒓& 𝒓$ 𝒑$ 𝒑&

slide-51
SLIDE 51

51

Attention in practice

Β§ Attention is used to create a compositional embedding of value vectors, according to a query

  • E.g. in seq2seq models (comes next)
  • Where values are the vectors at encoding, and query is the current

decoding state

  • or in document classification (Assignment 4)
  • Where values are document’s word vectors, and query is a

parameter vector

ATT π’˜$ π’˜& π’˜' π’˜( 𝒓 𝒑

slide-52
SLIDE 52

52

Self-attention

Β§ In self-attention, values are the same as queries: 𝑹 = 𝑾 Β§ Mainly used to encode a sequence 𝑾 to another sequence H 𝑾 Β§ Each encoded vector is a contextual embedding of the corresponding input vector

  • L

π’˜2 is the contextual embedding of π’˜2

ATT π’˜$ π’˜& π’˜' T π’˜$ T π’˜& π’˜' π’˜& π’˜$ T π’˜' 𝑾

ATT

Z 𝑾 𝑾

slide-53
SLIDE 53

53

Attention – summary Β§ Attention is a way to focus on particular parts of the input, and create a compositional embedding Β§ It is done by defining an attention distribution over inputs, and calculating their weighted sum Β§ A more generic definition of attention network has two inputs: key vectors 𝑳, and value vectors 𝑾

  • Key vectors are used to calculate attentions
  • and, as before, output is the weighted sum of value vectors
  • In practice, in most cases 𝑳 = 𝑾. So we consider our (slightly

simplified) definition in most parts of this course

slide-54
SLIDE 54

Agenda

  • Sequence-to-sequence models
  • Attention Mechanism
  • seq2seq with Attention
slide-55
SLIDE 55

55

RNN' RNN' RNN'

𝑭 π’š($) π’Š($)

RNN'

𝑭 π’š(&) 𝑦(&) 𝑭 π’š(') 𝑦(') 𝑭 π’š(() π’Š(&) π’Š(') π’Š(()

RNN( RNN( RNN(

𝑽 𝒛($) 𝒕($) 𝑧($)

< sos >

𝑽 𝒛(&) 𝑧(&) 𝑽 𝒛(') 𝑧(') 𝒕(&) 𝒕(') 𝑦(()

< eos >

𝑦($)

< sos >

… EN ENCOD ODER ER DE DECODE DER Bottleneck problem in basic seq2seq

All information of source sequence must be embedded in the last hidden

  • state. Information bottleneck!
slide-56
SLIDE 56

56

Seq2seq + Attention Β§ It can be useful, if we allow decoder directly access all elements of source sequence, and to decide where on source sequence to focus Β§ Attention is a solution to the bottleneck problem Β§ At each decoding time step, decoder attends on vectors

  • f source sequence,
  • and therefore bypasses the bottleneck!
slide-57
SLIDE 57

57

RNN' RNN' RNN'

𝑭 π’š($) π’Š($)

RNN'

𝑭 π’š(&) 𝑦(&) 𝑭 π’š(') 𝑦(') 𝑭 π’š(() π’Š(&) π’Š(') π’Š(()

RNN(

𝑽 𝒛($) 𝒕($) 𝑧($)

< sos >

𝑧(&) 𝑧(') 𝑦(()

< eos >

𝑦($)

< sos >

Seq2seq with attention

π’Šβˆ—($) ATT

⨁

𝑿 1 𝒛($)

context vector

  • f time step 1 of decoding

vectors concatenation

slide-58
SLIDE 58

58

RNN' RNN' RNN'

𝑭 π’š($) π’Š($)

RNN'

𝑭 π’š(&) 𝑦(&) 𝑭 π’š(') 𝑦(') 𝑭 π’š(() π’Š(&) π’Š(') π’Š(()

RNN( RNN(

𝑽 𝒛($) 𝒕($) 𝑧($)

< sos >

𝑽 𝒛(&) 𝑧(&) 𝑧(') 𝒕(&) 𝑦(()

< eos >

𝑦($)

< sos >

ATT π’Šβˆ—(&)

⨁

𝑿 1 𝒛(&)

Seq2seq with attention context vector

  • f time step 2 of decoding
slide-59
SLIDE 59

59

RNN' RNN' RNN'

𝑭 π’š($) π’Š($)

RNN'

𝑭 π’š(&) 𝑦(&) 𝑭 π’š(') 𝑦(') 𝑭 π’š(() π’Š(&) π’Š(') π’Š(()

RNN( RNN( RNN(

𝑽 𝒛($) 𝒕($) 𝑧($)

< sos >

𝑽 𝒛(&) 𝑧(&) 𝑽 𝒛(') 𝑧(') 𝒕(&) 𝒕(') 𝑦(()

< eos >

𝑦($)

< sos >

… ATT π’Šβˆ—(')

⨁

𝑿 1 𝒛(')

Seq2seq with attention context vector

  • f time step 3 of decoding
slide-60
SLIDE 60

60

Seq2seq with attention – formulation

EN ENCODER ER is the same as basic seq2seq DE DECODE DER R – in input ut

Β§ Decoder embedding

  • Decoder embeddings at input for target words (π•Ž0) β†’ 𝑽
  • Embedding of the target word 𝑧(2) (at time step 𝑒) β†’ 𝒛(2)

Β§ Decoder RNN 𝒕(+) = RNN(𝒕 +,$ , 𝒛(+))

  • The values of the last hidden state of the encoder RNN are

passed to the initial hidden state of the decoder RNN:

𝒕(-) = π’Š .

Parameters are shown in red

slide-61
SLIDE 61

61

Seq2seq with attention – formulation

DE DECODE DER R - at atten ention

Β§ Attention context vector π’Šβˆ—(+) = ATT 𝒕 + , π’Š $ , … , π’Š .

For instance, if ATT is a β€œbasic dot-product attention”, this is done by:

  • First calculating non-normalized attentions:

R 𝛽5

% = 𝒕 % 6π’Š5

  • Then, normalizing the attentions:

𝛽5

% = softmax(L

𝜷 % )5

  • and finally calculating the weighted sum of encoder hidden states

π’Šβˆ—(%) = I

5&! 𝑴

𝛽5

% π’Š5

Parameters are shown in red

slide-62
SLIDE 62

62

Seq2seq with attention – formulation

DE DECODE DER R - ou

  • utput

Β§ Decoder output prediction

  • Predicted probability distribution of words at the next time step:

1 𝒛(+) = softmax 𝑿[𝒕 + ; π’Šβˆ—(+)] + 𝒄 ∈ ℝ π•Ž

; denotes the concatenation of two vectors Β§ Training and inference of a seq2seq with attention are the same as a basic seq2seq model

Parameters are shown in red

slide-63
SLIDE 63

63

Seq2seq with attention – summary Β§ Attention on source sequence facilitates the selection of relevant parts, and flow of information Β§ Attention in seq2seq helps avoiding vanishing gradient problem

  • Provides shortcut to faraway states

Β§ Attention provides some interpretability

  • Looking at attention distribution, we can see what the decoder is

focusing on (try it here: https://distill.pub/2016/augmented-rnns/#attentional-interfaces)

  • However, it is still disputable, whether attention (especially in

Transformers) really provides explanation!

slide-64
SLIDE 64

64

Recap Β§ Sequence-to-sequence models generate language given an input text Β§ Attention is a general deep learning approach to learn to focus on certain parts, and compose outputs Β§ Attention significantly helps seq2seq models! 𝑾

ATT

𝑷 𝑹