Lecture #08 – Attention and Memory
Aykut Erdem // Hacettepe University // Spring 2020
CMP784
DEEP LEARNING
Illustration: DeepMind
CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem - - PowerPoint PPT Presentation
Illustration: DeepMind CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem // Hacettepe University // Spring 2020 Breaking news! Midterm exam next week (will be a take-home exam) Check the midterm guide for details 2
Lecture #08 – Attention and Memory
Aykut Erdem // Hacettepe University // Spring 2020
DEEP LEARNING
Illustration: DeepMind
Breaking news!
− Check the midterm guide for details
2
Previously on CMP784
(RNNs)
(LSTM) unit and its variants
image: Oleg Soroko3 Using RNNs to generate Super Mario Maker levels, Adam Geitgey
Lecture overview
Discl sclaimer: Much of the material and slides for this lecture were borrowed from
— Dzmitry Bahdanau’s IFT 6266 slides — Graham Neubig’s CMU CS11-747 Neural Networks for NLP class — Mateusz Malinowski’s lecture on Attention-based Networks — Yoshua Bengio’s talk on From Attention to Memory and towards Longer-Term Dependencies — Kyunghyun Cho’s slides on neural sequence modeling — Arian Hosseini’s IFT 6135 slides
4
Deep Learning for Vision
5
Figure credit: Xiaogang Wang
Deep Learning for Speech
6
Figure credit: NVidia
Deep Learning for Text
7
x1 x2 x3 x4 x5 z11 z12 z13 z14 z15 z16 z21 z22 z23 z24 z25 ˆ Y W1 W2 W3
positive “The movie was not bad at all. I had fun.”
Deep Models
8
Input Representation Feature Extractor (encoder) Classifier/Regressor (decoder)
GW2 FW1
Loss Function Fully Connected Network Typically a Linear Pr with some non-linearity a prior on the type of mation you want
“The movie was not bad at all. I had fun.”
can be seen as a prior on the type of transformation you want Fully Connected Network Convolution Network Recurrent Network Typically a Linear Projection with some non-linearity (log-soft-max)
Deep Models
9
Input Representation Feature Extractor (encoder) Classifier/Regressor (decoder)
GW2 FW1
Loss Function Fully Connected Network Typically a Linear Pr with some non-linearity a prior on the type of mation you want
“The movie was not bad at all. I had fun.”
can be seen as a prior on the type of transformation you want Fully Connected Network Convolution Network Recurrent Network Typically a Linear Projection with some non-linearity (log-soft-max)
Learnable parametric function Inputs: generally considered I.I.D. Outputs: classification or regression
Encoder-Decoder Framework
= ‘universal representation’
10
encoder
English decoder French sentence English sentence
English encoder
English decoder English sentence English sentence For bitext data For unilingual data
Sequence Representations
the sequence
11
this is an example this is an example
12
A lot of things are called “attention” these days...
with variable-length inputs and outputs (typical sequential).
convolutional networks for sequential data.
The shared idea: focus on the relevant parts of the input (output).
13
Attention in Deep Learning Applications [to Language Processing]
machine translation speech recognition speech synthesis, summarization, … any sequence-to-sequence (seq2seq) task
14
Traditional deep learning approach
input → d-dimensional feature vector → layer1 → .... → layerk → output Good for: image classification, phoneme recognition, decision-making in reflex agents (ATARI) Less good for: text classification Not really good for: … everything else?!
15
Example: Machine Translation
[“An”, “RNN”, “example”, “.”] → [“Un”, “example”, “de”, “RNN”, “.”] Machine translation presented a challenge to vanilla deep learning
in the output
16
Vanilla seq2seq learning for machine translation
Recurrent Continuous Translation Models, Kalchbrenner et al, EMNLP 2013 Sequence to Sequence Learning with Recurrent Neural Networks, Sutskever et al., NIPS 2014 Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, Cho et al., EMNLP 2014
input sequence
fixed size representation
17
Problems with vanilla seq2seq
models are needed
be slow and require lots of data
bottleneck looong term dependencies
18
Soft attention
lets decoder focus on the relevant hidden states
into the last hidden state ⇒ no bottleneck! dynamically creates shortcuts in the computation graph that allow the gradient to flow freely ⇒ shorter dependencies! best with a bidirectional encoder
19
Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau et al, ICLR 2015
Soft attention - math 1
At each step the decoder consumes a different weighted combination
20
Soft attention - math 2
But where do the weights come from? They are computed by another network! The choice from the original paper is 1-layer MLP:
21
Soft attention - computational aspects
The computational complexity of using soft attention is quadratic. But it’s not slow:
○
sum two vectors
○
apply tanh
○
compute dot product
○
add a vector to a matrix
○
apply tanh
○
compute vector-matrix product
22
Soft attention - visualization
[penalty???]
Great visualizations at http://distill.pub/2016/augmented-rnns/#attentional-interfaces
23
Great visualizations at https://distill.pub/2016/augmented-rnns/#attentional-interfaces
24
(Bahdanau et al 2014, Jean et al 2014, Gulcehre et al 2015, Jean et al 2015)
Soft attention - visualization
Soft attention - improvements
no performance drop on long sentences much better than RNN Encoder-Decoder without unknown words comparable with the SMT system
25
5 10 15 20 25 2013 2014 2015 2016 Phrase-based SMT Syntax-based SMT Neural MT
End-to-End Machine Translation with Recurrent Nets and Attention Mechanism
26
(Bahdanau et al 2014, Jean et al 2014, Gulcehre et al 2015, Jean et al 2015)
Figure credit: Rico Sennrich
25 Phrase-based SMT SMT Syntax-based SMT SMT Neural MT
Soft content-based attention pros and cons
Pros
Cons
alignment (handwriting recognition, speech recognition)
27
Location-based attention
BiRNN)
are computed relative to the previous attention weights
28
Gaussian mixture location-based attention
Originally proposed for handwriting synthesis. The (unnormalized) weight of the input position u at the time step t is parametrized as a mixture of K Gaussians
29
Section 5, Generating Sequence with Recurrent Neural Networks, A. Graves 2014
Gaussian mixture location-based attention
The new locations of Gaussians are computed as a sum of the previous ones and the predicted offsets
30
Gaussian mixture location-based attention
The first soft attention mechanism ever! Pros:
Cons:
31
Various Soft-Attentions
attention
used in Memory Networks)
See “Attention-Based Models for Speech Recognition” by Chorowski et al (2015) for a scalability analysis of various attention mechanisms on speech recognition.
32
a(q, k) = q|k p |k| a(q, k) = w|
2tanh(W1[q; k])
a(q, k) = q|Wk a(q, k) = q|k k
Various Attention Score Functions
Multi-laye ayer Perce ceptron
(Bahdanau et al. 2015)
− Flexible, often very good with large data
ar (Luong et al. 2015)
Dot Product ct (Luong et al. 2015)
− No parameters! But requires sizes to be the same.
caled Do Dot Product ct (Vaswani et al. 2017)
− Problem: scale of dot product increases as dimensions get • larger − Fix: scale by size of the vector
33
Going back in time: Connection Temporal Classification (CTC)
that is still widely used
bias for monotonous seq2seq transduction
ways of inserting blank tokens in the output so that it aligns with the input
34
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, Graves et al, ICML 2006
CTC
labeling input sum over all labelling with blanks conditional probability of a labeling with blanks probability of
at the step t
35
CTC
a monotonic alignment
can be carried out with forward-backward algorithm (a.k.a. dynamic programming)
recognition
and x but this can be fixed
36
Soft Attention and CTC for seq2seq: summary
attention and it is very widely used, especially in natural language processing
input and the output can be monotonously aligned; location-based and content-based approaches can be mixed
monotonous alignments
37
Visual and Hard Attention
38
Models of Visual Attention
resolution.
space to acquire information when and where it is needed, and combine information from different fixations over time to build up an internal representation of the scene” (Mnih et al, 2014)
an input image at each step and combines information from multiple steps
39
Recurrent Models of Visual Attention, V. Mnih et al, NIPS 2014
A Recurrent Model of Visual Attention
“retina-like” representation glimpse location (sampled from a Gaussian) RNN state action (e.g. output a class)
40
A Recurrent Model of Visual Attention - math 1
Objective: When used for classification the correct class is known. Instead of sampling the actions the following expression is used as a reward: ⇒ optimizes Jensen lower bound on the log-probability p(a*|x)!
interaction sequence sum of rewards
41
A Recurrent Model of Visual Attention
The gradient of J has to be approximated (REINFORCE) Baseline is used to lower the variance of the estimator:
next action
42
A Recurrent Visual Attention Model - visualization
43
Soft and Hard Attention
RAM attention mechanism is hard - it outputs a precise location where to look. Content-based attention from neural MT is soft - it assigns weights to all input locations. CTC can be interpreted as a hard attention mechanism with tractable gradient.
44
Soft and Hard Attention
Soft
Hard
* deterministic hard attention would not have gradients ** exact gradient can be computed for models with tractable marginalization (e.g. CTC)
45
Soft and Hard Attention
Can soft content-based attention be used for vision? Yes.
Show Attend and Tell, Xu et al, ICML 2015
Can hard attention be used for seq2seq? Yes.
Learning Online Alignments with Continuous Rewards Policy Gradient, Luo et al, NIPS 2016 (but the learning curves are a nightmare…)
46
DRAW: soft location-based attention for vision
47
Why attention?
− Dealing with gradient vanishing problem
− Attending/focusing to smaller parts of data
§ patches in images § words or phrases in sentences
− Different problems required different sizes of representations
§ LSTM with longer sentences requires larger vectors
− Focusing only on the parts of images − Scalability independent of the size of images
48
Recurrent net memory Attention mechanism
Attention on Memory Elements
current networks ks ca cannot remember things s for very long
We need a “hippoca campus” s” (a se separate memory module)
ks [Weston et 2014] (FAIR), associative memory
Recall: Long-Term Dependencies
with a step in the forward computation. To store information robustly in a finite-dimensional state, the dynamics must be contractive [Bengio et al 1994].
50
Storing bits robustly requires
(Hochreiter 1991) Gr Gradie ient cl clipping
× input input gate forget gate
state self-loop × + ×
Gated Recurrent Units & LSTM
Create a a pa path wh wher ere gradients s ca can fl flow w fo for longer er wi with th se self-lo loop
Jacobian slightly less than 1
heavily use sed (Hochreiter & Schmidhuber 1997)
(Cho et al 2014)
51
xt xt−1 xt+1 x unfold s
st st+1
W1 W3 W1 W1 W1 W1 W3
st−2
W3 W3 W3
Delays & Hierarchies to Reach Farther
scales, Elhihi & Bengio NIPS 1995, Koutnik et al ICML 2014
52
Hierarchical RNNs (words / sentences): Sordoni et al CIKM 2015, Serban et al AAAI 2016
Large Memory Networks: Sparse Access Memory for Long-Term Dependencies
long durations, until evoked for read or write
53
passive copy access
Memory Networks
that can read and write to it.
soning with at atten ention over me memo mory (RAM).
“low level” tasks e.g. object detection.
54
Jason Weston, Sumit Chopra, Antoine Bordes. Memory Networks. ICLR 2016
Ankit Kumar et al. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. ICML 2016 Alex Graves et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626): 471–476, 2016.
Case Study: Show, Attend and Tell
55
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov,
Paying Attention to Selected Parts
While Uttering Words
56
57
softmaxˆ p1 h1 x1
<s>
∼
Akiko
h2
softmaxx2
∼
likes
x3 h3
softmax∼
Pimm’s
x4 h4
softmax∼
</s>
Sutskever et al. (2014)
58
softmaxˆ p1 h1 x1
<s>
∼
a
h2
softmaxx2
∼
man
x3 h3
softmax∼
is
x4 h4
softmax∼
rowing
Vinyals et al. (2014) Show and Tell: A Neural Image Caption Generator Vinyals et al. (2014) Show and Tell: A Neural Image Caption Generator
Regions in ConvNets
feature vectors(/matrices).
59
Each point in a “higher” level of a convnet Xu et al. calls these “annotation vectors”, ai, i ∈ {1, . . . , L}
Regions in ConvNets
60
a1 a1
F =
Regions in ConvNets
61
Regions in ConvNets
62
Extension of LSTM via the context vector
− Lower convolutional layer to have the correspondence between the feature vectors and portions of the 2-D image
63
fully
E - embeddin y - captions h - previous h z - context ve representation part of the ima
B B @ it ft
gt 1 C C A = B B @ σ σ σ tanh 1 C C A TD+m+n,n @ Eyt−1 ht−1 ˆ zt 1 A (1) ct = ft ct−1 + it gt (2) ht = ot tanh(ct). (3)
eti =fatt(ai, ht−1) αti = exp(eti) PL
k=1 exp(etk)
.
ˆ zt = φ ({ai} , {αi})
is the ‘attention’ (‘focus’) fun p(yt|a, yt−1
1
) / exp(Lo(Eyt−1 + Lhht + Lzˆ zt)) E: embedding matrix y: captions h: previous hidden state z: context vector, a dynamic representation
is the ‘attention’ (‘focus’) function – ‘soft’ / ’hard’
A MLP conditioned on the previous hidden state
Hard attention
64
We have two sequences ‘I’ that runs over localizations ‘t’ that runs over words Stochastic decisions are discrete here, so derivatives are zero
ˆ zt = φ ({ai} , {αi})
eti =fatt(ai, ht−1) αti = exp(eti) PL
k=1 exp(etk)
.
X])
Loss is a variational lower bound on the marginal log-likelihood Due to Jensen’s inequality
ector Ls = X
s
p(s | a) log p(y | s, a) log X
s
p(s | a)p(y | s, a) = log p(y | a)
the marginal log-likelihood
the model
the E[log(X)]≤ lo
Due to Jensen’s inequality
discussed
p(st,i = 1 | sj<t, a) = αt,i ˆ zt = X
i
st,iai.
∂Ls ∂W = X
s
p(s | a) ∂ log p(y | s, a) ∂W + X])
corresponding elihood
log p(y | s, a)∂ log p(s | a) ∂W
X])
function
pa-
sequence
E[log(X)]≤ log(E[X])
lity
[1] J. Ba et al. “Multiple object recognition with visual attention” [2] A. Mnih et al. “Neural variational inference and learning in belief networks”
state
∂Ls ∂W ≈ 1 N
NX
n=1∂ log p(y | ˜ sn, a) ∂W + λr(log p(y | ˜ sn, a) − b)∂ log p(˜ sn | a) ∂W + λe ∂H[˜ sn] ∂W
verage by
∂Ls ∂W ≈ 1 N
NX
n=1∂ log p(y | ˜ sn, a) ∂W + log p(y | ˜ sn, a)∂ log p(˜ sn | a) ∂W
st ∼ MultinoulliL({αi})
[1] J. Ba et. al. “Multiple object recognition wTo reduce the estimator variance, entropy term H[s] and bias are added [1,2]
Hard attention
65
We have two sequences ‘I’ that runs over localizations ‘t’ that runs over words Stochastic decisions are discrete here, so derivatives are zero
ˆ zt = φ ({ai} , {αi})
eti =fatt(ai, ht−1) αti = exp(eti) PL
k=1 exp(etk)
.
X])
Loss is a variational lower bound on the marginal log-likelihood Due to Jensen’s inequality
ector Ls = X
s
p(s | a) log p(y | s, a) log X
s
p(s | a)p(y | s, a) = log p(y | a)
the marginal log-likelihood
the model
the E[log(X)]≤ lo
Due to Jensen’s inequality
discussed
p(st,i = 1 | sj<t, a) = αt,i ˆ zt = X
i
st,iai.
∂Ls ∂W = X
s
p(s | a) ∂ log p(y | s, a) ∂W + X])
corresponding elihood
log p(y | s, a)∂ log p(s | a) ∂W
X])
function
pa-
sequence
E[log(X)]≤ log(E[X])
lity
[1] J. Ba et al. “Multiple object recognition with visual attention” [2] A. Mnih et al. “Neural variational inference and learning in belief networks”
state
∂Ls ∂W ≈ 1 N
NX
n=1∂ log p(y | ˜ sn, a) ∂W + λr(log p(y | ˜ sn, a) − b)∂ log p(˜ sn | a) ∂W + λe ∂H[˜ sn] ∂W
verage by
∂Ls ∂W ≈ 1 N
NX
n=1∂ log p(y | ˜ sn, a) ∂W + log p(y | ˜ sn, a)∂ log p(˜ sn | a) ∂W
st ∼ MultinoulliL({αi})
[1] J. Ba et. al. “Multiple object recognition wTo reduce the estimator variance, entropy term H[s] and bias are added [1,2]
zer zero-one
cisi sion about where to attend
reinforcement learning
Soft attention
66
| ˆ zt = X
i
st,iai.
Instead o
P computing
log-lik
Ep(st|a)[ˆ zt] =
L
X
i=1
αt,iai
φ ({ai} , {αi}) = PL
i αiai
et al. (2014). This corresponds
T d Instead of making hard decisions, we take the expected context vector The whole model is smooth and differentiable under the deterministic attention; learning via a standard backprop Theoretica cal arguments
variable
Eq.
Finally
xpectation NWGM[p(yt = k | a)] = Q
i exp(nt,k,i)p(st,i=1|a)
P
j
Q
i exp(nt,j,i)p(st,i=1|a)
Q P Q = exp(Ep(st|a)[nt,k]) P
j exp(Ep(st|a)[nt,j])
E[nt] = Lo(Eyt−1 + LhE[ht] + LzE[ˆ zt]). the NWGM of a softmax unit is obtained by
[1] P. Baldi et al. “The dropout learning algorithm”
vector
), n
with
[1] NWGM[p(yt = k | a)] ≈ E[p(yt = k | a)]
softmax activation. That means the expectation of the
Q P Q
.
|
using a single forw ector Ep(st|a)[ˆ zt]. ,
ctor
wski
.
P
Theoretical argu
alue Ep(st|a)[ht] ard prop with
Soft attention
67
| ˆ zt = X
i
st,iai.
Instead o
P computing
log-lik
Ep(st|a)[ˆ zt] =
L
X
i=1
αt,iai
φ ({ai} , {αi}) = PL
i αiai
et al. (2014). This corresponds
Q
Instead of making hard decisions, we take the expected context vector The whole model is smooth and differentiable under the deterministic attention; learning via a standard backprop Theoretica cal arguments
variable
Eq.
Finally
xpectation NWGM[p(yt = k | a)] = Q
i exp(nt,k,i)p(st,i=1|a)
P
j
Q
i exp(nt,j,i)p(st,i=1|a)
Q P Q = exp(Ep(st|a)[nt,k]) P
j exp(Ep(st|a)[nt,j])
E[nt] = Lo(Eyt−1 + LhE[ht] + LzE[ˆ zt]). the NWGM of a softmax unit is obtained by
[1] P. Baldi et al. “The dropout learning algorithm”
vector
), n
with
[1] NWGM[p(yt = k | a)] ≈ E[p(yt = k | a)]
softmax activation. That means the expectation of the
Q P Q
.
|
using a single forw ector Ep(st|a)[ˆ zt]. ,
ctor
wski
.
P
Theoretical argu
alue Ep(st|a)[ht] ard prop with
How soft/hard attention works
68
How soft/hard attention works
69
Sample regions of attention A variational lower bound of maximum likelihood Computes the expexted attention
70
Hard Attention
71
Soft Attention
The Good
72
And the Bad
73
Quantitative results
74
Human Automatic Model M1 M2 BLEU CIDEr Human 0.638 0.675 0.471 0.91 Google? 0.273 0.317 0.587 0.946 MSR• 0.268 0.322 0.567 0.925 Attention-based⇤ 0.262 0.272 0.523 0.878 Captivator 0.250 0.301 0.601 0.937 Berkeley LRCN⇧ 0.246 0.268 0.534 0.891
M1: human preferred (or equal) the method over human annotation M2: turing test
+2 BL BLEU
+4 BL BLEU
Video Description Generation
75
− Context set consists of per-frame context vectors, and attention mechanism that selects one of those vectors for each output symbol being decoded – capturing the global temporal structure across frames − 3-D conv-net that applies local filters across spatio-temporal dimensions working on motion statistics
3D ConvNet
Internal self-attention in deep learning models
In addition to connecting the decoder with the encoder, attention can be used inside the model, replacing RNN and CNN! Transformer from Google
Attention Is All You Need, Vaswani et al, NIPS 2017
76
Parametrization – Recurrent Neural Nets
contextualized vectors.
embeddings
BERT [Devlin et al., 2019] and all the other muppets
77
x1, x2, . . . , xTx
<latexit sha1_base64="Z9/H7I5ZYgNdnrx+gRsrtE3C0ro=">ACq3icfZHbhMxEIad5VSWUwpXiBuXCAmhEu2WonJZCS64QRTRtBVxtPJ6J6lVH1b2bElYrXgabuF5eBu8yUaiLWIky5/+Ue2Z/JSY9J8rsXbt+4+atjdvxnbv37j/obz48rZyAkbCKutOcu5BSQMjlKjgpHTAda7gOD972+aPz8F5ac0hLkqYaD4zcioFxyBl/cfzLN2m82xnmzJVWPTtoT7M5k3WHyTDZBn0KqQdDEgXB9lm7xsrKg0GBSKez9OkxInNXcohYImZpWHkoszPoNxQM1+Em9/ENDnwWloFPrwjJIl+rfFTX3i90Hpya46m/nGvFf+XGFU7fTGpygrBiNVF0pRtLRtC2kA4FqEYALJ8NbqTjljgsMbYtjZuCrsFpzU9TMSFNA2QSwyLZYCa4M21aHzUXz2tum6MpG1753EDrk4EN47cgcbTuRc24m2lpmiWwlv5n5PO1MVAcpVens1VONoZpq+Grz/tDvZ3u7ltkCfkKXlOUrJH9sl7ckBGRJDv5Af5SX5FL6P0ZeIraxRr6t5RC5EBH8AQLULQ=</latexit>Encoder Decoder
p(yl|y<l, X)
<latexit sha1_base64="pjMFZf3jTgp6JRzTYZtnGunX9ag=">ACoXicfZHLbhMxFIad4VaGS1NYsnGJkApC0QwUlQWLSrCABSIg0kbKRKMzklq1TfZnsIwzJOwhYfibfAkE4m2iCNZ/vT/v+WjcwojuPNJ8rsXbl67fqNrZvxrdt37m73d+4dOV1ahmOmhbaTAhwKrnDsuRc4MRZBFgKPi9PXrX98htZxrT7yuBMwlLxBWfg5T3t81elYvV6/Es3TyeO8P0iGyaroZUg7GJCuRvlO71s216yUqDwT4Nw0TYyf1WA9ZwKbOCsdGmCnsMRpQAUS3axed7QR0GZ04W24ShPV+rfL2qQzlWyCEkJ/sRd9FrxX9609IuXs5orU3pUbP3RohTUa9qOgc65ReZFQCY5aFXyk7AvNhWHGcKfzCtJSg5nWmuJqjaQJon+1mBq0J126HzfnwJtadB2jm9wbDBOy+D50+yFI4LV9Umdgl5KrZgVZS/8LwtdNMFActpVe3M1lOHo2TJ8PX3zcHxzud3vbIg/IQ7JHUnJADslbMiJjwkhJfpCf5Fc0iN5Fo+jTOhr1ujf3ybmKpn8A5PjQ1Q=</latexit>NLL
y∗
l
<latexit sha1_base64="1uItCexg0/+G2JBVfzOFkBTrz4=">AClXicfZFNSxBEIZ7JyYxkw81HnLw0mYJBA/LjBqS0BQi6igawKO5ulpqd2bewvuns06zC/wWvy0/Jv0rM7C36EFDT98NZbdHVbgR3Pkn+dKJHS4+fPF1+Fj9/8fLVyura6xOnS8uwz7TQ9iwHh4Ir7HvuBZ4ZiyBzgaf5xX6TP71E67hW3/3U4FDCRPExZ+CD1J/+2BqJ0Wo36SWzoA8hbaFL2jgerXWus0KzUqLyTIBzgzQxfliB9ZwJrOsdGiAXcAEBwEVSHTDatZtTd8FpaBjbcNRns7U2xUVSOemMg9OCf7c3c814r9yg9KPw0rkzpUbH5Q+NSUK9p83VacIvMi2kAYJaHXik7BwvMhwHFcabwimkpQRVprgq0NQBtM82M4PWhGuzxfqueFtUnRuowvfAYJWTwM3R4FCby2W1UGdiK5qmeQNfQ/I/xcGAPFYVvp/d08hJPtXrT+/Bt7u32+5tmWyQt+Q9SclHske+kmPSJ4xwckN+kd/Rm+hzdB9mVujTluzTu5EdPQXstvMjw=</latexit>y∗
1, y∗ 2, . . . , y∗ l−1
<latexit sha1_base64="vx/XlptpfJV8udVlOej67SJ26c=">ACsXicfZFNbxMxEIadLR9l+WgKRw64REioKtFuWwTHSuXABVEk0hZlQ/B6J6kVf8meLaSrPfJruMKP4d/gTYSbREjWX78zjuyPZNbKTwmye9OtHbj5q3b63fiu/fuP9jobj489qZ0HAbcSONOc+ZBCg0DFCjh1DpgKpdwks8Om/zJOTgvjP6IcwsjxaZaTARnGKRx98n8/Y43aHNtrtDM1kY9MtjJV+k9bjbS/rJIuh1SFvokTaOxpudi6wvFSgkUvm/TBNLI4q5lBwCXWclR4s4zM2hWFAzRT4UbX4SU2fBaWgE+PC0kgX6t8VFVPez1UenIrhmb+a8R/5YlTl6PKqFtiaD58qJKSka2rSFsIBRzkPwLgT4a2UnzHOIbmxXGm4Ss3SjFdVJkWugBbBzCYbWUWnA3bVov1ZfPK26To0kZXvjcQOuTgXjt+yAxNG67ypibKqHrBWQN/c/Ivq2MgeIwrfTqbK7D8W4/3eu/LDfO9hv57ZOHpOn5DlJyStyQN6SIzIgnHwnP8hP8ivaiz5FX6J8aY06bc0jcimi2R+OdWj</latexit>ht = [− → h t; ← − h t], − → h t = (xt, − → h t−1), ← − h t = (xt, ← − h t+1)
<latexit sha1_base64="WdtZynw8vn0i/T8oyAgirhDM/4U=">ADVHicfZHRbtMwFIadlsEIMLpxyY1HhVSgqxrYBJCmgQX3FAGomulpopc96SxltiRfUJbouzheAgk3oUL3DaV1q1wJMu/vMd2/I/SmNhsN3+7VSqt3Zu39m96967/2DvYW3/4NyoTHPochUr3R8xA7GQ0EWBMfRTDSwZxdAbXbxf9HvfQRuh5DecpzBM2ESKUHCG1gpqP6MA6Ts68JWltJhEyLRW0zwqAnxLl24M4RVz2HQvfYQZ5tMINBSX7rZRe+QK+trpFI1ZgE26BcvxyCueNd0t1/zrgA0qxd2PqjV2632suhN4ZWiTso6C/adH/5Y8SwBiTxmxgy8dorDnGkUPIbC9TMDKeMXbAIDKyVLwAz5WcX9Kl1xjRU2i6JdOlenchZYsw8GVkyYRiZ672Fua03yDB8M8yFTDMEyVcXhVlMUdFcnQsNHCM51YwroV9K+UR04yjzd1fQlTrpKEyXHuSyHkBZWKPQP/R0arfDUhab8JpdtOgKo2vuA9gf0vDJvaztRgq/Tz3mZ4kQhZL4S/U/0A2W4NWuTYt73o2N8X5y5b3qnXy5bh+elzmtksekyekQTzympySj+SMdAl3Gk7H6Tn9yq/Kn2q1urNCK04584hsVHXvL6R/Fn4=</latexit>Parametrization – Recurrent Neural Nets
source vectors
component in many recent advances
…
78
x1, x2, . . . , xTx
<latexit sha1_base64="Z9/H7I5ZYgNdnrx+gRsrtE3C0ro=">ACq3icfZHbhMxEIad5VSWUwpXiBuXCAmhEu2WonJZCS64QRTRtBVxtPJ6J6lVH1b2bElYrXgabuF5eBu8yUaiLWIky5/+Ue2Z/JSY9J8rsXbt+4+atjdvxnbv37j/obz48rZyAkbCKutOcu5BSQMjlKjgpHTAda7gOD972+aPz8F5ac0hLkqYaD4zcioFxyBl/cfzLN2m82xnmzJVWPTtoT7M5k3WHyTDZBn0KqQdDEgXB9lm7xsrKg0GBSKez9OkxInNXcohYImZpWHkoszPoNxQM1+Em9/ENDnwWloFPrwjJIl+rfFTX3i90Hpya46m/nGvFf+XGFU7fTGpygrBiNVF0pRtLRtC2kA4FqEYALJ8NbqTjljgsMbYtjZuCrsFpzU9TMSFNA2QSwyLZYCa4M21aHzUXz2tum6MpG1753EDrk4EN47cgcbTuRc24m2lpmiWwlv5n5PO1MVAcpVens1VONoZpq+Grz/tDvZ3u7ltkCfkKXlOUrJH9sl7ckBGRJDv5Af5SX5FL6P0ZeIraxRr6t5RC5EBH8AQLULQ=</latexit>Encoder Decoder
p(yl|y<l, X)
<latexit sha1_base64="pjMFZf3jTgp6JRzTYZtnGunX9ag=">ACoXicfZHLbhMxFIad4VaGS1NYsnGJkApC0QwUlQWLSrCABSIg0kbKRKMzklq1TfZnsIwzJOwhYfibfAkE4m2iCNZ/vT/v+WjcwojuPNJ8rsXbl67fqNrZvxrdt37m73d+4dOV1ahmOmhbaTAhwKrnDsuRc4MRZBFgKPi9PXrX98htZxrT7yuBMwlLxBWfg5T3t81elYvV6/Es3TyeO8P0iGyaroZUg7GJCuRvlO71s216yUqDwT4Nw0TYyf1WA9ZwKbOCsdGmCnsMRpQAUS3axed7QR0GZ04W24ShPV+rfL2qQzlWyCEkJ/sRd9FrxX9609IuXs5orU3pUbP3RohTUa9qOgc65ReZFQCY5aFXyk7AvNhWHGcKfzCtJSg5nWmuJqjaQJon+1mBq0J126HzfnwJtadB2jm9wbDBOy+D50+yFI4LV9Umdgl5KrZgVZS/8LwtdNMFActpVe3M1lOHo2TJ8PX3zcHxzud3vbIg/IQ7JHUnJADslbMiJjwkhJfpCf5Fc0iN5Fo+jTOhr1ujf3ybmKpn8A5PjQ1Q=</latexit>NLL
y∗
l
<latexit sha1_base64="1uItCexg0/+G2JBVfzOFkBTrz4=">AClXicfZFNSxBEIZ7JyYxkw81HnLw0mYJBA/LjBqS0BQi6igawKO5ulpqd2bewvuns06zC/wWvy0/Jv0rM7C36EFDT98NZbdHVbgR3Pkn+dKJHS4+fPF1+Fj9/8fLVyura6xOnS8uwz7TQ9iwHh4Ir7HvuBZ4ZiyBzgaf5xX6TP71E67hW3/3U4FDCRPExZ+CD1J/+2BqJ0Wo36SWzoA8hbaFL2jgerXWus0KzUqLyTIBzgzQxfliB9ZwJrOsdGiAXcAEBwEVSHTDatZtTd8FpaBjbcNRns7U2xUVSOemMg9OCf7c3c814r9yg9KPw0rkzpUbH5Q+NSUK9p83VacIvMi2kAYJaHXik7BwvMhwHFcabwimkpQRVprgq0NQBtM82M4PWhGuzxfqueFtUnRuowvfAYJWTwM3R4FCby2W1UGdiK5qmeQNfQ/I/xcGAPFYVvp/d08hJPtXrT+/Bt7u32+5tmWyQt+Q9SclHske+kmPSJ4xwckN+kd/Rm+hzdB9mVujTluzTu5EdPQXstvMjw=</latexit>y∗
1, y∗ 2, . . . , y∗ l−1
<latexit sha1_base64="vx/XlptpfJV8udVlOej67SJ26c=">ACsXicfZFNbxMxEIadLR9l+WgKRw64REioKtFuWwTHSuXABVEk0hZlQ/B6J6kVf8meLaSrPfJruMKP4d/gTYSbREjWX78zjuyPZNbKTwmye9OtHbj5q3b63fiu/fuP9jobj489qZ0HAbcSONOc+ZBCg0DFCjh1DpgKpdwks8Om/zJOTgvjP6IcwsjxaZaTARnGKRx98n8/Y43aHNtrtDM1kY9MtjJV+k9bjbS/rJIuh1SFvokTaOxpudi6wvFSgkUvm/TBNLI4q5lBwCXWclR4s4zM2hWFAzRT4UbX4SU2fBaWgE+PC0kgX6t8VFVPez1UenIrhmb+a8R/5YlTl6PKqFtiaD58qJKSka2rSFsIBRzkPwLgT4a2UnzHOIbmxXGm4Ss3SjFdVJkWugBbBzCYbWUWnA3bVov1ZfPK26To0kZXvjcQOuTgXjt+yAxNG67ypibKqHrBWQN/c/Ivq2MgeIwrfTqbK7D8W4/3eu/LDfO9hv57ZOHpOn5DlJyStyQN6SIzIgnHwnP8hP8ivaiz5FX6J8aY06bc0jcimi2R+OdWj</latexit>αt0 ∝ exp((ht0, zt−1, yt−1)) ct =
Tx
X
t0=1
αt0ht0 zt = ([yt−1; ct], zt−1) p(yt = v|y<t, X) ∝ exp((zt, v))
<latexit sha1_base64="QAGPd4LHywyn5W8GQpvTvzutNhA=">ADaHicfZFb9MwFMeTlsItw4eEOLFW8Vo0agaGAKJVRqCB17YBmq3SnUXOa7bWk0cy3ZK25CvicRX4FNwcinaTRzJ8k/n/P0/Jzm+DLg27fZvu1K9cfPW7Y07zt179x8rG0+OtFRrCjr0SiIVN8nmgVcsJ7hJmB9qRgJ/YCd+rNPWf10zpTmkeiapWTDkEwEH3NKDKS82q8dTAI5JV5iXqQISxVJEyHMFrKBDVuY5GO3mzameXkXreB+5QIsC2g2EcbODvUM6iCs4zDTdz0LOl6C7A7Zz0tO4B8Vchz+Hh2ljUNp9QOA0/NemMJeNpWc6c/Qza7pvoHm/ed2gRz0YFKx30bzZ9Gr1dqudB7oKbgl1q4xjb9Ne4VFE45AJQwOi9cBtSzNMiDKcBix1cKyZJHRGJmwAKEjI9DJF5Ci5AZoXGk4AiD8uz5FwkJtV6GPihDYqb6ci1LXlcbxGb8fphwIWPDBC0ajeMAwadn20Qjrhg1wRKAUMVhVkSnRBFqYOeOgwX7QaMwJGKUYMHFiMkUIDJ4C0umJFxbJaYXxWtVkKFDK1nxn8IcW+wrRHkCImUi8TNQk5CLNAWf0PyFZrIVADmzLvbybq3DyuW+ab39tlc/2Cv3tmE9s7athuVa76wD64t1bPUsau/bvj2zg8qfaq36pPq0kFbs8s1j60JUt/8C0EcUBQ=</latexit>Side-note: gated recurrent units to attention
79
ht = ut ht−1 + (1 ut) ˜ ht, ˜ ht = f(xt, ht−1)
<latexit sha1_base64="5wriGvWZXFVdX4URBTwRuH1fDFQ=">AC9HicfZHNbtQwFIU94a+Evyks2biMkKbQjhIog1SJViwQRSJaSuNR5Hj3JlYTezIvmlniNInYfY8hy8Ai/BFpY4MxlBW8SV7Hw659ix7o2LTFoMgu8d79LlK1evrV3b9y8dftOd/3uvtWlETAUOtPmMOYWMqlgiBIzOCwM8DzO4CA+etX4B8dgrNTqA84LGOd8quRECo5OirpxGiF9SUu3M51opGlU4XZY08e0H247ebPVGcosgSqtnV9vnTKEGVYnKRioT/0/ZnPZpD+LcGt102bU7QWDYFH0IoQt9Ehbe9F65yNLtChzUCgybu0oDAocV9ygFBnUPistFwc8SmMHCqegx1Xi2bU9KFTEjrRxi2FdKH+faLiubXzPHbJnGNqz3uN+C9vVOLkxbiSqigRlFj+aFJmFDVtOksTaUBgNnfAhZHurVSk3HCBrv+zxScCJ3nXCUVU1IlUNQONLINVoAp3GejxfpseJVtLqM0VXuNbgOGXjrXvOSRy1eVQxbqa5VPUCWEP/C/LZKujId9MKz8/mIuw/GYRPB8/e7/R2d9q5rZH75AHpk5A8J7vkDdkjQyLIN/KD/CS/vGPvk/fZ+7KMep32zD1ypryvwHU9/Ep</latexit>Side-note: gated recurrent units to attention
hidden vectors:
80
ht =ut ht−1 + (1 ut) ˜ ht, =ut (ut−1 ht−2 + (1 ut−1) ˜ ht−1) + (1 ut) ˜ ht, =ut (ut−1 (ut−2 ht−3 + (1 ut−2) ˜ ht−2) + (1 ut−1) ˜ ht−1) + (1 ut) ˜ ht,
t
X
i=1
@
t−i+1
Y
j=i
uj 1 A i−1 Y
k=1
(1 uk) ! ˜ hi
<latexit sha1_base64="cuHZREnpiY4tg92aO0t98HoBsDw=">AEaXicrVJdb9MwFHW7AiN8rfC4MWjYmoZq5puCF4qTYIHXhBDotukuotcx28+iOznUGJ8jeR+A38CZw0Re1WsRcsOT459x7buI7ijkztP5Valu1G7dvrN517t3/8HDR1v1x8dGJZrQPlFc6dMRNpQzSfuWU5PY02xGHF6Mpq+z+Mnl1QbpuRXO4vpUOCJZGNGsHVUK/8jAILezswcQdSobIwClK752dwFzb9PUe3Sh5ZxkOaRpmLZ6+h5C3mthMysylOt2/dYrQulp5RusGtxvNvPl7d8V8f9m8u97c0d7/6nAHXVwkOFzvHQJZvERyCQiSFnPz85cFU7HtolircIgPe+x7Mz5sV3nmATnSLNJZFsrmel7K8p6KfatULTXEvGCr0Wl3igWvA78EDVCuIzcJP1CoSCKotIRjYwZ+J7bDFGvLCKeZhxJDY0ymeEIHDkosqBmxQRm8KVjQjhW2m1pYcEuZ6RYGDMTI6cU2Ebmaiwn18UGiR2/G6ZMxomlksyNxgmHVsF8nGHINCWzxzARDPXKyQR1phYN/SehyT9RpQWIYpkyGNM4cUBZto5jq2B3bJcxWxQtHoJzGVzoPlD3hzT95Lr97ChslX6VIqwngsmsAChH/xLi7wuhQ/lt+Vfv5jo47rb9/fabLweNw4Py3jbBc/ACNIEP3oJD8BEcgT4g1V6VHlVbPyu1WtPa8/m0mqlzHkCVlat8Qd2VmkR</latexit>Side-note: gated recurrent units to attention
weights?
heads?
81
ht =
t
X
i=1
@
t−i+1
Y
j=i
uj 1 A i−1 Y
k=1
(1 − uk) ! ˜ hi
<latexit sha1_base64="7XVTWD31JIyz3GBUw4py/GnjTYQ=">AC+3icfZFLbxMxEICd5VXCoykcubhESCkoURaK4BKpEhy4IpE2kpxunK8TtaNHyt7FgjW/hpuiCu/gzM/hCvgTYSaREjWf40841szUxyKRz0+z8a0aXLV65e27revHz1u3t1s6dI2cKy/iQGWnsyYQ6LoXmQxAg+UluOVUTyY8n8xdV/fg9t04Y/Q4WOR8rOtNiKhiFkEpaIksAD4grVOLFIC5PARPJp9AhuTVp4s8Gojz10BWP4hIXyRmxYpbB3oYzr/q86AajE3eLZL5XW5iAkCn3WZmIZtJq93v9ZeCLENfQRnUcJjuNTyQ1rFBcA5PUuVHcz2HsqQXBJC+bpHA8p2xOZ3wUFPF3dgvZ1LiByGT4qmx4WjAy+zfHZ4q5xZqEkxFIXPna1XyX7VRAdPnYy90XgDXbPXQtJAYDK4GjFNhOQO5CECZFeGvmGXUgZhDc0m0fwDM0pRnXqihU5XgYwQHZJzm0ert0ay0157VYlvNLw2nvJw4Qsfx1+yakKBj70BNqZ0rocgmkov+J9ONaDFRtKz6/m4tw9LgXP+k9fbvfPtiv97aF7qH7qINi9AwdoFfoEA0RQ9/RT/QL/Y7K6HP0Jfq6UqNG3XMXbUT07Q84HfSn</latexit>ht =
t
X
i=1
αi˜ hi, αi ∝ exp((˜ hi, xt))
<latexit sha1_base64="Fj/zphCLxbrgT8mRA0O5Xp2U9Q4=">AC+3icfZFLbxMxEICd5VXCK4UjF5cIKUWoykIRXCoVwYELokhJWykOK8c7yVq1vZY9SxNW2z/DXHld3Dmh3AFvHmIPhAjWf40841szYyskh673R+N6NLlK1evrV1v3rh56/ad1vrdfZ8XTkBf5Cp3hyPuQUkDfZSo4NA64Hqk4GB09KquH3wE52VuejizMNR8YuRYCo4hlbRkliDdYb7QSl34uoDUsaVzXgiKUOpUizKpGPTxjCFMvjDBxUJ38V63KLOWUwtZ2F8rLXqzqnW+k0wc3NZtJqd7e686AXIV5CmyxjL1lvfGJpLgoNBoXi3g/irsVhyR1KoaBqsKD5eKIT2AQ0HANfljOZ1LRhyGT0nHuwjFI59nTHSX3s/0KJiaY+bP1+rkv2qDAscvhqU0tkAwYvHQuFA0TKEeME2lA4FqFoALJ8Nfqci4wLDGpNZuBY5Fpzk5bMSJOCrQLkyDaYBWfDtbHE6qy8cusSXWh05b2GMCEHb8Nv34Ux9w9Khl3Ey1NQdW0/9EPl2Jgeptxed3cxH2n2zFT7evd9u724v97ZG7pMHpENi8pzskjdkj/SJIN/JT/KL/I6q6HP0Jfq6UKPGsuceORPRtz94fZv</latexit>ht =
t
X
i=1
αif(xi), αi ∝ exp((f(xi), xt))
<latexit sha1_base64="nanXEc2zQYGev51vQvlRCNXa0V8=">AC8XicfZHLbhMxFIad4VbCpSks2bhESAlCUQaKYFOpCBZsEVK2kpxGDnOSWJ1fJF9hiaMpu/BDrHlOXgInoEt7PEkE0RbxJEsfzr/f2TrPyObSo/d7vdadOnylavXNq7Xb9y8dXuzsXnwJvMCegLkxp3NOIeUqmhjxJTOLIOuBqlcDg6flnqhx/AeWl0DxcWhopPtZxIwTG0kYyS5DuMp+pJe7cfEeKeOpnfFE0klrnsj2o1OGMf8ZAYOitM/KrPOWDSUwdy2VpYXvV7RqboPMF2u540mt1Od1n0IsQVNElV+8lW7SMbG5Ep0ChS7v0g7loc5tyhFCkUdZ5sFwc8ykMAmquwA/zZRIFfRA6YzoxLhyNdNn9eyLnyvuFGgWn4jz57Wy+S9tkOHk+TCX2mYIWqwemQpDQGUsdKxdCAwXQTgwsnwVypm3HGBIfx6nWk4EUYprsc501KPwRYBDLJtZsHZcG1XWJw1r72lRFc2uva9gpCQgzfht29Di6NxD3PG3VRJXSyBlfQ/I5+vjYHKbcXnd3MRDh534iedp+92mns71d42yD1yn7RITJ6RPfKa7JM+EeQb+UF+kl+Rjz5Fn6MvK2tUq2bukjMVf0N5HPwzQ=</latexit>ht =
t
X
i=1
αiV (f(xi)), αi ∝ exp((K(f(xi)), Q(xt)))
<latexit sha1_base64="DHCBrs1oBbqSP6I674aSpLe1gBk=">AC+nicfZFLbxMxEICd5VXCoykcubhESAlCURaK2kulIjgIUQrJWmlOKwcZ5K1umtb9ixNWLY/hviyu/gzv/gCsKbR0VbxEiWP818I1szQ5NIh+32j0pw5eq16zfWblZv3b5zd72ca/ndGYFdIVOtD0acgeJVNBFiQkcGQs8HSZwODx+WdYP4B1UqsOzgwMUj5RciwFR5+KanEcId1lLkujXO6GxXukjCcm5pGkvca4MY1ks/nklCFMT+JwUJxeiYwY7VBTRlMTWOhvOh0isabs0Z64AGbzWY1qtXbrfY86GUIl1Any9iPNiof2UiLAWFIuHO9cO2wUHOLUqRQFlmQPDxTGfQN+j4im4QT4fSUEf+cyIjrX1RyGdZ/uyHnq3CwdejPlGLuLtTL5r1o/w/HOIJfKZAhKLB4aZwn1YyjnS0fSgsBk5oELK/1fqYi5QL9FqpVpuBE6DTlapQzJdUITOFBI9tkBqzx1+YSi/Pyi1LdKHRlfcK/IQsvPW/fedTHLV9nDNuJ6lUxRxYSf8T+XQleiq3FV7czWXoPW2Fz1rPD7bqe1vLva2RB+QhaZCQbJM98prsky4R5Dv5SX6R38Gn4HPwJfi6UIPKsuc+ORfBtz98DfM</latexit>ht = [h1
t; · · · ; hK t ], hk t = t
X
i=1
αk
i V k(f(xi)), αk i ∝ exp((Kk(f(xi)), Qk(xt)))
<latexit sha1_base64="kcXR15P1JO0fkCILanYiu9TzVgw=">ADMnicfZHfahNBFMZnV601/mql95MDUIiJWS1VaEKnohFLGFJC1kmUye5IM2Z0ZmZt4rJ9Fh/Cl9E78daHcDZ/IE3FA8v85jvfYfzDVTMjW0fnj+rdt3tu5u3yvdf/Dw0U593HyFQzaDMZS30xoAZiLqBtuY3hQmgySCG8HkfdE/wLacCladqagl9CR4EPOqHVSWP42Di1u4q47+sERJiyS1hzh4nrS278iFqY2uxyDhvyqVKgT3CQmTcKMN4O8bzGhsRrTkLtGpz+pDqvTkNdqG5NrJqK0VFZiAlNVXZjetVp59WRtGJ+5yzS0tVqtFJYrjXpjXvgmBEuoGWdhrveVxJliYgLIupMd2goWwvo9pyFkNeIqkBRdmEjqDrUNAETC+brzLHz50S4aHU7hMWz9X1iYwmxsySgXMm1I7NZq8Q/9Xrpnb4tpdxoVILgi1+NExj7FZR5IjroHZeOaAMs3dWzEbU02ZdemVSkTAJZNJQkWUEcFBCp3IC3ZIwq0csfeEvPr5pW3aOGFDa98H8BtSMn9rPTqJW6hcZoXqUcJHPgRT0PyOdroyOirSCzWxuQudlPXhVPzw7qBwfLHPbRk/RM1RFAXqDjtFHdIraiHlb3r536L32v/s/V/+74XV95YzT9C18v/8Bf+VBbg=</latexit>1 2 3 4
keys values queries
Generalized dot-product attention - vector form
82
Generalized dot-product attention - matrix form
queries, values
83
Three types of attention in Transformer
Q=[current state] K=V=[BiRNN states]
Q=K=V=[encoder states]
but a states can only attend previous states) Q=K=V=[decoder states]
84
Other tricks in Transformer
(trainable parameter embeddings also work)
85
decoder
the encoder and the decoder
Transformer Full Model and Performance
86
87
Attention Is All You Need, Vaswani et al, NIPS 2017
Transformer Model
model (from the original paper)
stack of encoders (6 in this paper)
also a stack of decoders of the same number
88
Transformer Model: Encoder
down into 2 parts
89
Transformer Model: Encoder
90
Transformer Model: Encoder
didn't cross the street because it was too tired”
91
Self-Attention: Step 1 (Create Vectors)
92
Self-Attention: Step 2 (Calculate score), 3 and 4
93
Self-Attention:
Step 5
vector by the softmax score
value vectors
94
Self-Attention: Matrix Form
95
Self-Attention:
Multiple Heads
96
Self-Attention: Multiple Heads
97
Self-Attention: Multiple Heads
focusing (the model’s repr of “i “it” ” has some of “an animal al” and “t “tired” ed”)
harder to interpret
98
Positional Embeddings
99
What does it look like?
Positional Embeddings
100
The Residuals
followed by a layer normalization
101
The Residuals
sub-layers in decoder as well
102
The Decoder
y at attend to ear arlier posi sitions s in the output sequence.
future positions (setting them to -in inf before the softmax in calculation)
103
Final Layer
104
y at attend to ear arlier posi sitions s in the output sequence.
future positions (setting them to -inf f before the softmax in calculation)
Results
105
attention key size hurts the model
model is better
helpful
learned positional emb have same results
106
Image Transformer
107
Image Transformer
108
Task
image generation
Image Transformer
109
Unconditional Image Generation CelebA Super-resolution
Image Transformer
CelebA Super-resolution
110
Image Transformer
Cifar10 Super-resolution
111
Image Transformer
112
Conditional Image Completion Cifar10 Samples
Music Transformer Generating Music With Long-Term Structure
113
Music Language model:
Prior work Performance RNN (Simon & Oore, 2016)
Music Transformer Generating Music With Long-Term Structure
114
Let’s hear some samples!
Music Transformer: Self-Similarity
115
Music Transformer: Samples
116
Continuations to given initial motif Given motif RNN-LSTM Vanilla Transformer
(vs WaveNet)
beyond the length of training examples
Music Transformer
Summary
○
connecting encoder and decoder in sequence-to-sequence task
○
achieving scale-invariance and focus in image processing
○
self-attention can be a basic building block for neural nets, often replacing RNNs and CNNs [recent research, take it with a grain of salt]
117
118