Deep Learning
Sequence to Sequence models: Attention Models
1
Sequence to Sequence models: Attention Models 1 - - PowerPoint PPT Presentation
Deep Learning Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem: A sequence goes in A different sequence comes out E.g. Speech recognition: Speech goes in, a word
1
– A sequence
– A different sequence
comes out
– Speech recognition: Speech goes in, a word sequence comes
– Machine translation: Word sequence goes in, word sequence comes out
– No synchrony between and .
2
3
v
I ate an apple Ich habe einen apfel gegessen I ate an apple
Time X(t) Y(t) t=0 h-1
4
5
v
I ate an apple Ich habe einen apfel gegessen
6
Four score and seven years ??? A B R A H A M L I N C O L ??
7
h-1
probability to the next word in the sequence
𝑍 𝑢, 𝑗 = 𝑄(𝑊
|𝑥 … 𝑥)
is the i-th symbol in the vocabulary
𝐸𝑗𝑤 𝐙 1 … 𝑈 , 𝐙(1 … 𝑈) = 𝑌𝑓𝑜𝑢 𝐙 𝑢 , 𝐙(𝑢)
h-1 Y(t) DIVERGENCE
to the correct next word
10
– One-hot vectors
– Outputs an N-valued probability distribution rather than a one-hot vector
sequence is the i-th word in the vocabulary given all previous t-1 words
– One-hot vectors
– Outputs an N-valued probability distribution rather than a one-hot vector
– And set it as the next word in the series
sequence is the i-th word in the vocabulary given all previous t-1 words
– And draw the next word from the output probability distribution
– In some cases, e.g. generating programs, there may be a natural termination
sequence is the i-th word in the vocabulary given all previous t-1 words
sequence is the i-th word in the vocabulary given all previous t-1 words
– And draw the next word from the output probability distribution
15
16
four score and eight
– This is clearly the middle of sentence
<sos> four score and eight
– This is a fragment from the start of a sentence
four score and eight <eos>
– This is the end of a sentence
<sos> four score and eight <eos>
– This is a full sentence
but <sos> is required to terminate sequences
sentence, e.g just <eos> , or even a separate symbol, e.g. <s>
17
– And draw the next word from the output probability distribution
– Or we decide to terminate generation based on some other criterion
19
I ate an apple Ich habe einen apfel gegessen
20
21
First process the input and generate a hidden representation for it
# First run the inputs through the network # Assuming h(-1,l) is available for all layers for t = 0:T-1 # Including both ends of the index [h(t),..] = RNN_input_step(x(t),h(t-1),...) H = h(T-1)
22
“RNN_input” may be a multi-layer RNN of any kind
23
Then use it to generate an output First process the input and generate a hidden representation for it
# First run the inputs through the network # Assuming h(-1,l) is available for all layers for t = 0:T-1 # Including both ends of the index [h(t),..] = RNN_input_step(x(t),h(t-1),...) H = h(T-1) # Now generate the output yout(1),yout(2),… t = 0 hout(0) = H do t = t+1 [y(t),hout(t)] = RNN_output_step(hout(t-1)) yout(t) = draw_word_from(y(t)) until yout(t) == <eos>
24
# First run the inputs through the network # Assuming h(-1,l) is available for all layers for t = 0:T-1 # Including both ends of the index [h(t),..] = RNN_input_step(x(t),h(t-1),...) H = h(T-1) # Now generate the output yout(1),yout(2),… t = 0 hout(0) = H do t = t+1 [y(t),hout(t)] = RNN_output_step(hout(t-1)) yout(t) = draw_word_from(y(t)) until yout(t) == <eos>
25
The output at each time is a probability distribution
We draw a word from this distribution
26
Then use it to generate an output First process the input and generate a hidden representation for it
# First run the inputs through the network # Assuming h(-1,l) is available for all layers for t = 0:T-1 # Including both ends of the index [h(t),..] = RNN_input_step(x(t),h(t-1),...) H = h(T-1) # Now generate the output yout(1),yout(2),… t = 0 hout(0) = H do t = t+1 [y(t),hout(t)] = RNN_output_step(hout(t-1)) yout(t) = draw_word_from(y(t)) until yout(t) == <eos>
27
Changing this output at time t does not affect the output at t+1 E.g. If we have drawn “It was a” vs “It was an”, the probability that the next word is “dark” remains the same (dark must ideally not follow “an”) This is because the output at time t does not influence the computation at t+1
28
– The hidden activation at the <eos> “stores” all information about the sentence
produce a sequence of outputs
– The output at each time becomes the input at the next time – Output production continues until an <eos> is produced
29
I ate an apple<eos>
– The hidden activation at the <eos> “stores” all information about the sentence
produce a sequence of outputs
– The output at each time becomes the input at the next time – Output production continues until an <eos> is produced
30
I ate an apple<eos>
– The hidden activation at the <eos> “stores” all information about the sentence
produce a sequence of outputs
– The output at each time becomes the input at the next time – Output production continues until an <eos> is produced
31
<sos> Ich I ate an apple <eos>
– The hidden activation at the <eos> “stores” all information about the sentence
produce a sequence of outputs
– The output at each time becomes the input at the next time – Output production continues until an <eos> is produced
32
Ich habe Ich <sos> I ate an apple<eos>
– The hidden activation at the <eos> “stores” all information about the sentence
produce a sequence of outputs
– The output at each time becomes the input at the next time – Output production continues until an <eos> is produced
33
<sos> Ich habe einen Ich habe I ate an apple <eos>
– The hidden activation at the <eos> “stores” all information about the sentence
produce a sequence of outputs
– The output at each time becomes the input at the next time – Output production continues until an <eos> is produced
34
<sos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos>
35
I ate an apple <sos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen <eos> <sos>
# First run the inputs through the network # Assuming h(-1,l) is available for all layers t = 0 do [h(t),..] = RNN_input_step(x(t),h(t-1),...) until x(t) == “<eos>” H = h(T-1) # Now generate the output yout(1),yout(2),… t = 0 hout(0) = H # Note: begins with a “start of sentence” symbol # <sos> and <eos> may be identical yout(0) = <sos> do t = t+1 [y(t),hout(t)] = RNN_output_step(hout(t-1), yout(t-1)) yout(t) = draw_word_from(y(t)) until yout(t) == <eos>
36
# First run the inputs through the network # Assuming h(-1,l) is available for all layers t = 0 do [h(t),..] = RNN_input_step(x(t),h(t-1),...) until x(t) == “<eos>” H = h(T-1) # Now generate the output yout(1),yout(2),… t = 0 hout(0) = H # Note: begins with a “start of sentence” symbol # <sos> and <eos> may be identical yout(0) = <sos> do t = t+1 [y(t),hout(t)] = RNN_output_step(hout(t-1), yout(t-1)) yout(t) = draw_word_from(y(t)) until yout(t) == <eos>
37
Drawing a different word at t will change the next output since yout(t) is fed back as input
38
<sos> Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos>
39
Ich habe einen apfel gegessen <eos> I ate an apple <sos> Ich habe einen apfel gegessen
– 𝑧
= 𝑄 𝑃 = 𝑥|𝑃, … , 𝑃, 𝐽, … , 𝐽
– The probability given the entire input sequence 𝐽, … , 𝐽 and the partial output sequence 𝑃, … , 𝑃 until 𝑙
40
I ate an apple <sos>
𝑧
<eos>
– 𝑧
= 𝑄 𝑃 = 𝑥|𝑃, … , 𝑃, 𝐽, … , 𝐽
– The probability given the entire input sequence 𝐽, … , 𝐽 and the partial output sequence 𝑃, … , 𝑃 until 𝑙
41
𝑧
Ich I ate an apple <sos> <eos>
– 𝑧
= 𝑄 𝑃 = 𝑥|𝑃, … , 𝑃, 𝐽, … , 𝐽
– The probability given the entire input sequence 𝐽, … , 𝐽 and the partial output sequence 𝑃, … , 𝑃 until 𝑙
42
𝑧
Ich Ich I ate an apple <sos> <eos>
– 𝑧
= 𝑄 𝑃 = 𝑥|𝑃, … , 𝑃, 𝐽, … , 𝐽
– The probability given the entire input sequence 𝐽, … , 𝐽 and the partial output sequence 𝑃, … , 𝑃 until 𝑙
43
𝑧
𝑧
Ich Ich I ate an apple <sos> <eos>
– 𝑧
= 𝑄 𝑃 = 𝑥|𝑃, … , 𝑃, 𝐽, … , 𝐽
– The probability given the entire input sequence 𝐽, … , 𝐽 and the partial output sequence 𝑃, … , 𝑃 until 𝑙
44
𝑧
𝑧
Ich Ich habe I ate an apple <sos> <eos>
– 𝑧
= 𝑄 𝑃 = 𝑥|𝑃, … , 𝑃, 𝐽, … , 𝐽
– The probability given the entire input sequence 𝐽, … , 𝐽 and the partial output sequence 𝑃, … , 𝑃 until 𝑙
45
Ich habe
𝑧
𝑧
Ich Ich habe I ate an apple <sos> <eos>
– 𝑧
= 𝑄 𝑃 = 𝑥|𝑃, … , 𝑃, 𝐽, … , 𝐽
– The probability given the entire input sequence 𝐽, … , 𝐽 and the partial output sequence 𝑃, … , 𝑃 until 𝑙
46
Ich habe
𝑧
𝑧
𝑧
Ich Ich habe I ate an apple <sos> <eos>
– 𝑧
= 𝑄 𝑃 = 𝑥|𝑃, … , 𝑃, 𝐽, … , 𝐽
– The probability given the entire input sequence 𝐽, … , 𝐽 and the partial output sequence 𝑃, … , 𝑃 until 𝑙
47
Ich habe
𝑧
𝑧
𝑧
Ich Ich habe einen I ate an apple <sos> <eos>
– 𝑧
= 𝑄 𝑃 = 𝑥|𝑃, … , 𝑃, 𝐽, … , 𝐽
– The probability given the entire input sequence 𝐽, … , 𝐽 and the partial output sequence 𝑃, … , 𝑃 until 𝑙
48
𝑧
𝑧
𝑧
𝑧
𝑧
𝑧
Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos> I ate an apple <sos> <eos>
previous outputs
49
Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen
𝑧
𝑧
𝑧
𝑧
𝑧
𝑧
I ate an apple <sos> <eos>
# First run the inputs through the network # Assuming h(-1,l) is available for all layers t = 0 do [h(t),..] = RNN_input_step(x(t),h(t-1),...) until x(t) == “<eos>” H = h(T-1) # Now generate the output yout(1),yout(2),… t = 0 hout(0) = H # Note: begins with a “start of sentence” symbol # <sos> and <eos> may be identical yout(0) = <sos> do t = t+1 [y(t),hout(t)] = RNN_output_step(hout(t-1), yout(t-1)) yout(t) = draw_word_from(y(t)) until yout(t) == <eos>
50
What is this magic operation?
,…,
O1 O2 O3 O4 O5 <eos>
𝑧
𝑧
𝑧
𝑧
𝑧
𝑧
O1 O2 O3 O4 O5 I ate an apple <sos> <eos>
sequence?
52
𝑧
𝑧
𝑧
𝑧
𝑧
𝑧
Objective:
,…,
O2 O3 O4 O5 <eos> O1 O2 O3 O4 O5 I ate an apple <sos> <eos>
# First run the inputs through the network # Assuming h(-1,l) is available for all layers t = 0 do [h(t),..] = RNN_input_step(x(t),h(t-1),...) until x(t) == “<eos>” H = h(T-1) # Now generate the output yout(1),yout(2),… t = 0 hout(0) = H # Note: begins with a “start of sentence” symbol # <sos> and <eos> may be identical yout(0) = <sos> do t = t+1 [y(t),hout(t)] = RNN_output_step(hout(t-1), yout(t-1)) yout(t) = argmaxi(y(t,i)) until yout(t) == <eos>
53
Select the most likely output at each time
– That may cause the distribution to be more “confused” at the next time – Choosing a different, less likely word could cause the distribution at the next time to be more peaky, resulting in a more likely output overall
54
𝑧
𝑧
𝑧
𝑧
𝑧
𝑧
Objective:
,…,
O2 O3 O4 O5 <eos> O1 O2 O3 O4 O5 I ate an apple <sos> <eos>
must be text)
– The model is very confused at t=3 and assigns low probabilities to many words at the next time – Selecting any of these will result in low probability for the entire 3-word sequence
– “he knows” is a reasonable beginning and the model assigns high probabilities to words such as “something” – Selecting one of these results in higher overall probability for the 3-word sequence
55
T=0 1 2 T=0 1 2 w1 w2 w3 wV …
𝑄(𝑃|𝑃, 𝑃, 𝐽, … , 𝐽)
w1 w2 w3 wV …
𝑄(𝑃|𝑃, 𝑃, 𝐽, … , 𝐽)
– Should we draw “nose” or “knows”? – Effect may not be obvious until several words down the line – Or the choice of the wrong word early may cumulatively lead to a poorer overall score over time
56
T=0 1 2 w1 w2 w3 wV …
𝑄(𝑃|𝑃, 𝐽, … , 𝐽)
What should we have chosen at t=2?? Will selecting “nose” continue to have a bad effect into the distant future?
promising future
– Even earlier: Choosing the lower probability “the” instead of “he” at T=0 may have made a choice of “nose” more reasonable at T=1..
– But we cannot know at that time the choice was poor
57
T=0 1 2 w1 the w3 he …
𝑄(𝑃|𝐽, … , 𝐽)
What should we have chosen at t=1?? Choose “the” or “he”?
58
𝑧
𝑧
𝑧
𝑧
𝑧
𝑧
Objective:
,…,
O2 O3 O4 O5 <eos> O1 O2 O3 O4 O5 I ate an apple <eos><sos>
# First run the inputs through the network # Assuming h(-1,l) is available for all layers t = 0 do [h(t),..] = RNN_input_step(x(t),h(t-1),...) until x(t) == “<eos>” H = h(T-1) # Now generate the output yout(1),yout(2),… t = 0 hout(0) = H # Note: begins with a “start of sentence” symbol # <sos> and <eos> may be identical yout(0) = <sos> do t = t+1 [y(t),hout(t)] = RNN_output_step(hout(t-1), yout(t-1)) yout(t) = sample(y(t)) until yout(t) == <eos>
59
Randomly sample from the output distribution.
– Unfortunately, not guaranteed to give you the most likely output – May sometimes give you more likely outputs than greedy drawing though
60
𝑧
𝑧
𝑧
𝑧
𝑧
𝑧
Objective:
,…,
O2 O3 O4 O5 <eos> O1 O2 O3 O4 O5 I ate an apple <sos> <eos>
61 I He We The
<sos>
62 I He We The
<sos>
63 I He We The
64 I He We The
65 He The
Knows … I Nose …
<sos>
66 He The
Knows … I Nose …
<sos>
67 He The
Nose …
<sos>
68 He The
Nose …
<sos>
69 He The
Nose
<sos>
– When the current most likely path overall ends in <eos>
get N-best outputs
70 He The Knows <eos> Nose
<sos>
– Paths cannot continue once the output an <eos>
– Select the most likely sequence ending in <eos> across all terminating sequences
71 He The Knows <eos> Nose <eos> <eos>
Example has K = 2
<sos>
# Assuming encoder output H is available path = <sos> beam = {path} pathscore = [path] = 1 state[path] = h[0] # Output of encoder do # Step forward nextbeam = {} nextpathscore = [] nextstate = {} for path in beam: cfin = path[end] hpath = state[path] [y,h] = RNN_output_step(hpath,cfin) for c in Symbolset newpath = path + c nextstate[newpath] = h nextpathscore[newpath] = pathscore[path]*y[c] nextbeam += newpath # Set addition end end beam, pathscore, state, bestpath = prune(nextstate,nextpathscore,nextbeam,bw) until bestpath[end] = <eos>
72
# Note, there are smarter ways to implement this function prune (state, score, beam, beamwidth) sortedscore = sort(score) threshold = sortedscore[beamwidth] prunedstate = {} prunedscore = [] prunedbeam = {} bestscore = -inf bestpath = none for path in beam: if score[path] > threshold: prunedbeam += path # set addition prunedstate[path] = state[path] prunedscore[path] = score[path] if score[path] > bestscore bestscore = score[path] bestpath = path end end end return prunedbeam, prunedscore, prunedstate, bestpath
73
74
Ich habe einen apfel gegessen <eos> Ich habe einen apfel gegessen I ate an apple <eos> <sos>
– Output will be a probability distribution over target symbol set (vocabulary)
75
<sos> Ich habe einen apfel gegessen
ate an apple <eos>
76
Ich habe einen apfel gegessen
habe einen apfel gegessen <eos> Div Div Div Div Div Div <sos> I ate an apple <eos>
distribution and target word sequence
network to learn the net
77
Ich habe einen apfel gegessen
habe einen apfel gegessen <eos> Div Div Div Div Div Div <sos> I ate an apple <eos>
– Typical usage: Randomly select one word from each input training instance (comprising an input-output pair)
– Randomly select training instance: (input, output) – Forward pass – Randomly select a single output y(t) and corresponding desired output d(t) for backprop
78
habe einen apfel gegessen <eos> Div Div Div Div Div Div Ich habe einen apfel gegessen <sos> I ate an apple <eos>
79
80
Ich habe einen apfel gegessen
habe einen apfel gegessen <eos> Div Div Div Div Div Div I ate an apple <eos> <sos>
81
Ich habe einen apfel gegessen
habe einen apfel gegessen <eos> Div Div Div Div Div Div I ate an apple <eos> <sos>
82
Ich habe einen apfel gegessen
habe einen apfel gegessen <eos> I ate an apple <eos> <sos>
83
84
– From “Sequence-to-sequence learning with neural networks”, Sutskever, Vinyals and Le
85
– From “Sequence-to-sequence learning with neural networks”, Sutskever, Vinyals and Le
86
87
system
– Subsequent model is just the decoder end of a seq-to-seq model
Erhan
88
CNN Image
– Process it with CNN to get output of classification layer – Sequentially generate words by drawing from the conditional
89
– Process it with CNN to get output of classification layer – Sequentially generate words by drawing from the conditional
90
A
– Process it with CNN to get output of classification layer – Sequentially generate words by drawing from the conditional
91
A boy A
– Process it with CNN to get output of classification layer – Sequentially generate words by drawing from the conditional
92
A boy
A boy
– Process it with CNN to get output of classification layer – Sequentially generate words by drawing from the conditional
93
A boy
a A boy
– Process it with CNN to get output of classification layer – Sequentially generate words by drawing from the conditional
94
A boy
a surfboard A boy
a
– Process it with CNN to get output of classification layer – Sequentially generate words by drawing from the conditional
95
A boy
a surfboard<eos> A boy
a surfboard
– The image network is pretrained on a large corpus, e.g. image net
derivatives
– All components of the network, including final classification layer of the image classification net are updated – The CNN portions of the image classifier are not modified (transfer learning)
96
CNN Image
– The image network is pretrained on a large corpus, e.g. image net
derivatives
– All components of the network, including final classification layer of the image classification net are updated – The CNN portions of the image classifier are not modified (transfer learning)
97
A boy
a surfboard
– The image network is pretrained on a large corpus, e.g. image net
derivatives
– All components of the network, including final classification layer of the image classification net are updated – The CNN portions of the image classifier are not modified (transfer learning)
98
A boy
a surfboard<eos> A boy
a surfboard
Div Div Div Div Div <sos>
99
100
<sos> I ate an apple <eos> <sos> A better model: Encoded input embedding is input to all output timesteps A boy
a surfboard A boy
surfboard a <eos> Ich habe einen apfel gegessen <eos>
Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Venugopalan, Huijun Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko North American Chapter of the Association for Computational Linguistics, Denver, Colorado, June 2015.
101
# Assuming encoded input H (from text, image, video) # is available # Now generate the output yout(1),yout(2),… t = 0 hout(0) = H # Encoder embedding # Note: begins with a “start of sentence” symbol # <sos> and <eos> may be identical yout(0) = <sos> do t = t+1 [y(t),hout(t)] = RNN_output_step(hout(t-1), yout(t-1), H) yout(t) = generate(y(t)) # Beam search, random, or greedy until yout(t) == <eos>
102
# Assuming encoded input H (from text, image, video) # is available # Now generate the output yout(1),yout(2),… t = 0 hout(0) = H # Encoder embedding # Note: begins with a “start of sentence” symbol # <sos> and <eos> may be identical yout(0) = <sos> do t = t+1 [y(t),hout(t)] = RNN_output_step(hout(t-1), yout(t-1), H) yout(t) = generate(y(t)) # Beam search, random, or greedy until yout(t) == <eos>
103
Also consider encoder embedding
104
Ich habe einen apfel gegessen
habe einen apfel gegessen <eos> <sos> I ate an apple <eos>
105
I ate an apple <eos>
106
Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos> I ate an apple <eos> <sos>
– Some of which may be diluted downstream
– Recall input and output may not be in sequence – Have no way of knowing a priori which input must connect to what
107
an apple<eos> Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos> <sos> I ate
– Some of which may be diluted downstream
– Recall input and output may not be in sequence – Have no way of knowing a priori which input must connect to what output
– Variable sized inputs and outputs – Overparametrized – Connection pattern ignores the actual asynchronous dependence of output on input
108
Ich habe einen apfel gegessen Ich habe einen apfel gegessen <eos> <sos>
109
I ate an apple<eos>
110
I ate an apple <eos>
111
I ate an apple <eos>
with output time Input to hidden decoder layer:
112
I ate an apple <eos>
layer:
113
I ate an apple<eos>
layer:
habe einen Ich habe einen
<sos>
114
I ate an apple<eos>
layer:
habe einen Ich habe einen
<sos>
115
I ate an apple<eos>
habe einen Ich habe einen
116
I ate an apple<eos>
117
I ate an apple<eos>
What is this? Multiple options Simplest:
are different sizes:
118
I ate an apple<eos>
119
I ate an apple<eos>
<sos>
120
I ate an apple<eos>
<sos>
– Will be distribution over words – Draw a word from the distribution
121
I ate an apple<eos>
122
I ate an apple<eos>
123
I ate an apple<eos>
124
I ate an apple<eos>
125
I ate an apple<eos>
126
I ate an apple<eos>
127
I ate an apple<eos>
128
I ate an apple<eos>
129
I ate an apple<eos>
einen
130
I ate an apple<eos>
131
I ate an apple<eos>
– Will be a probability distribution over words – Draw a word from the distribution
132
I ate an apple<eos>
einen
133
I ate an apple<eos>
einen
# Assuming encoded input H = [henc[0]… henc[T]] is available t = 0 hout[-1] = 0 # Initial Decoder hidden state # Note: begins with a “start of sentence” symbol # <sos> and <eos> may be identical Yout[0] = <sos> do t = t+1 C = compute_context_with_attention(hout[t-1], H) y[t],hout[t] = RNN_decode_step(hout[t-1], yout[t-1], C) yout[t] = generate(y[t]) # Random, or greedy until yout[t] == <eos>
134
# Takes in previous state, encoder states, outputs attention-weighted context function compute_context_with_attention(h, x, H) # First compute attention e = [] for t = 1:T # Length of input e[t] = raw_attention(h, H[t]) end maxe = max(e) # subtract max(e) from everything to prevent underflow a[1..T] = exp(e[1..T] - maxe) # Component-wise exponentiation suma = sum(a) # Add all elements of a a[1..T] = a[1..T]/suma C = 0 for t = 1..T C += a[t] * H[t] end return C
135
I ate an apple<eos>
einen
argmax
,…,
𝑧
<sos>
137 I He We The
138 I He We The
139 I He We The
140 He The
Knows … I Nose …
141 He The
Knows … I Nose …
142 He The
Nose …
143 He The
Nose …
144 He The
Nose
– When the current most likely path overall ends in <eos>
get N-best outputs
145 He The Knows <eos> Nose
– Paths cannot continue once the output an <eos>
– Select the most likely sequence ending in <eos> across all terminating sequences
146 He The Knows <eos> Nose <eos> <eos>
Example has K = 2
# Assuming encoder output H = hin[1]… hin[T] is available path = <sos> beam = {path} pathscore = [path] = 1 state[path] = h[0] # initial state (computed using your favorite method) do # Step forward nextbeam = {} nextpathscore = [] nextstate = {} for path in beam: cfin = path[end] hpath = state[path] C = compute_context_with_attention(hpath, H) y,h = RNN_decode_step(hpath, cfin, C) for c in Symbolset newpath = path + c nextstate[newpath] = h nextpathscore[newpath] = pathscore[path]*y[c] nextbeam += newpath # Set addition end end beam, pathscore, state, bestpath = prune(nextstate,nextpathscore,nextbeam) until bestpath[end] = <eos>
147
# Assuming encoder output H = hin[1]… hin[T] is available path = <sos> beam = {path} pathscore = [path] = 1 state[path] = h[0] # computed using your favorite method context[path] = compute_context_with_attention(h[0], H) do # Step forward nextbeam = {} nextpathscore = [] nextstate = {} nextcontext = {} for path in beam: cfin = path[end] hpath = state[path] C = context[path] y,h = RNN_decode_step(hpath, cfin, C) nextC = compute_context_with_attention(h, H) for c in Symbolset newpath = path + c nextstate[newpath] = h nextcontext[newpath] = nextC nextpathscore[newpath] = pathscore[path]*y[c] nextbeam += newpath # Set addition end end beam, pathscore, state, context, bestpath = prune (nextstate, nextpathscore, nextbeam, nextcontext) until bestpath[end] = <eos>
148
Slightly more efficient. Does not perform redundant context computation
– It captures the relative importance of each position in the input to the current output
149
I ate an apple<eos>
150
i t t Plot of
𝒋
Color shows value (white is larger) Note how most important input words for any output word get automatically highlighted The general trend is somewhat linear because word order is roughly similar in both languages i
151
152
– At each time the output is a probability distribution over words
153
I ate an apple <eos>
habe einen apfel gegessen
𝑧
𝑧
𝑧
𝑧
𝑧
𝑧
<sos>
– Backpropagate derivatives through the network
154
I ate an apple <eos>
habe einen apfel gegessen
𝑧
𝑧
𝑧
𝑧
𝑧
𝑧
Ich habe einen apfel gegessen<eos>
Div Div Div Div Div Div
<sos>
<sos>
– Backpropagate derivatives through the network
155
I ate an apple <eos>
habe einen apfel gegessen
𝑧
𝑧
𝑧
𝑧
𝑧
𝑧
Ich habe einen apfel gegessen<eos>
Div Div Div Div Div Div
Back propagation also updates parameters of the “attention” function
– Backpropagate derivatives through the network
156
I ate an apple <eos>
habe apfel gegessen
𝑧
𝑧
𝑧
𝑧
𝑧
𝑧
Ich habe einen apfel gegessen<eos>
Div Div Div Div Div Div
*** <sos> Occasionally pass drawn output instead of ground truth, as input
157
158
– Derive “value”, and multiple “keys” from the encoder
, 𝐿 , 𝑗 = 1 … 𝑈, 𝑚 = 1 … 𝑂
– Derive one or more “queries” from decoder
, 𝑘 = 1 … 𝑁, 𝑚 = 1 … 𝑂
– Each query-key pair gives you one attention distribution
= 𝑏𝑢𝑢𝑓𝑜𝑢𝑗𝑝𝑜 𝑅 , 𝐿 , 𝑗 = 1 … 𝑈 , 𝐷
= [𝐷
important for the decode
159
160
attention”, Xu et al., 2016
– Filter outputs at each location are the equivalent of
𝑗 in the regular
sequence-to-sequence model
161
162