Lecture #3 – Sequential Processing with NNs and Attention
Aykut Erdem // Hacettepe University // Spring 2019
CMP722
ADVANCED COMPUTER VISION
Illustration: DeepMind
CMP722 ADVANCED COMPUTER VISION Lecture #3 Sequential Processing - - PowerPoint PPT Presentation
Illustration: DeepMind CMP722 ADVANCED COMPUTER VISION Lecture #3 Sequential Processing with NNs and Attention Aykut Erdem // Hacettepe University // Spring 2019 Illustration: Koma Zhang // Quanta Magazine Previously on CMP722 deep
Lecture #3 – Sequential Processing with NNs and Attention
Aykut Erdem // Hacettepe University // Spring 2019
Illustration: DeepMind
Previously on CMP722
Illustration: Koma Zhang // Quanta Magazine
Good news, everyone!
your project proposals next week!
start in 2 weeks time. Choose your papers!
3 3
Lecture overview
—Bill Freeman, Antonio Torralba and Phillip Isola’s MIT 6.869 class —Aaron van den Oord’s talk on “Neural Discrete Representation Learning” —Dzmitry Bahdanau’s IFT 6266 slides —Arian Hosseini’s IFT 6135 slides
4
Sequences
5
[http://moviebarcode.tumblr.com/]
time
6
Convolutions in time
time
7
It bothered him that the dog at three fourteen (seen from the side) should have the same name as the dog at three fifteen (seen from the front). — “Funes the Memorius”, Borges 1962
“The Persistence of Memory”, Dali 1931
8
[https://www.youtube.com/watch?v=wxfGT-kKxiM
9
time
Rufus
10
time
Douglas
11
time
Rufus Memory unit
12
time
Rufus Memory unit Rufus!
13
Recurrent Neural Networks (RNNs)
Hidden Outputs Inputs
14
Hidden Outputs Inputs
time
Recurrent Neural Networks (RNNs)
15
time
Hidden Outputs Inputs
Recurrent Neural Networks (RNNs)
16
Hidden Outputs Inputs Recurrent!
Recurrent Neural Networks (RNNs)
17
time
Hidden Outputs Inputs
Recurrent Neural Networks (RNNs)
18
time
Hidden Outputs Inputs …
Deep Recurrent Neural Networks (RNNs)
19
Backprop through time
time
Hidden Outputs Inputs
20
time
Hidden Outputs Inputs
21
Recurrent linear layer
22
The problem of long-range dependences
time
Hidden Outputs Inputs
through a long chain of dependences.
vanish or explode
23
time
Rufus Memory unit Rufus!
24
time
Memory units …
25
The problem of long-range dependences
Why not remember everything?
parameters we can use to model it
depends on the immediately preceding hidden state
dependences that are arbitrarily far apart
26
The problem of long-range dependences
Other methods exist that do directly link old “memories” (observations or hidden states) to future predictions:
27
Long Short Term Memory (LSTM)
A special kind of RNN designed to avoid forgetting. Related to resnets: inductive bias is that state transition is an identity function. This way the default behavior is not to forget an old state. Instead of forgetting by default, the network has to learn to forget.
28
29
[Slide derived from Chris Olah: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]
30
[Slide derived from Chris Olah: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]
Ct = Cell state
31
[Slide derived from Chris Olah: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]
Decide what information to throw away from the cell state. Each element of cell state is multiplied by ~1 (remember) or ~0 (forget).
32
[Slide derived from Chris Olah: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]
Decide what new information to add to the cell state. which indices to write to what to write to those indices
33
[Slide derived from Chris Olah: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]
Forget old selected old information, write selected new information.
34
[Slide derived from Chris Olah: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]
After having updated the cell state’s information, decide what to output.
35
[Slide derived from Chris Olah: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]
p
Synthesizing a pixel non-parametric sampling Input image
[Efros & Leung 1999]
Models
Texture synthesis by non-parametric sampling
36
[PixelRNN, PixelCNN, van der Oord et al. 2016]
Input partial image “white” Predicted color
Texture synthesis with a deep net
37
Input partial image Predicted color
“white” …
38
[PixelRNN, PixelCNN, van der Oord et al. 2016]
…
1
Prediction for a single pixel i,j
green gray blue teal brown red violet
Idea: We can represent colors as discrete classes
39
Softmax regression (a.k.a. multinomial logistic regression) predicted probability of each class given input x max likelihood learner! picks out the -log likelihood
under the model prediction
And we can interpret the learner as modeling P(next pixel | previous pixels):
40
Network output
…
…
turquoise blue green red
gray black white
p
probability
P(next pixel | previous pixels)
41
Network output
…
…
turquoise blue green red
gray black white
p
probability
42
Network output
…
…
turquoise blue green red
gray black white
probability
43
Network output
…
…
turquoise blue green red
gray black white
probability
44
Network output
…
…
turquoise blue green red
gray black white
probability
45
46
General product rule
The sampling procedure we defined above takes exact samples from the learned probability distribution (pmf). Multiplying all conditionals evaluates the probability of a full joint configuration
Autoregressive probability model
47
Models that allow us to sample, i.e. generat ate, images from scratch are called generative models. We will see more examples in a future lecture.
Autoregressive probability model
48
[PixelRNN, van der Oord et al. 2016]
Samples from PixelRNN
49
[PixelRNN, van der Oord et al. 2016]
Image completions (conditional samples) from PixelRNN
completions
50
Modeling Audio
51
Causal Convolution
52
Causal Convolution
Input Hidden Layer
Hidde Hidden Layer Layer Input Input
Causal Convolution
53
Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer
Causal Convolution
54
Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer Hidde Hidden Layer Layer
Causal Convolution
55
Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer Hidde Hidden Layer Layer Out Output put
Causal Convolution
56
Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer Hidde Hidden Layer Layer Out Output put
Causal Dilated Convolution
57
Input Input
Causal Dilated Convolution
58
Hidde Hidden Layer Layer Input Input
Causal Dilated Convolution
59
Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer dilation=1 dilation=2
Causal Dilated Convolution
60
Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer Hidde Hidden Layer Layer dilation=1 dilation=2 dilation=4
Causal Dilated Convolution
61
Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer Hidde Hidden Layer Layer Out Output put dilation=1 dilation=2 dilation=4 dilation=8
Causal Dilated Convolution
62
Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer Hidde Hidden Layer Layer Out Output put dilation=1 dilation=2 dilation=4 dilation=8
Multiple Stacks
63
Sampling
64
Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer Hidde Hidden Layer Layer Out Output put
Sampling
65
Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer Hidde Hidden Layer Layer Out Output put sample speech sample music
66
A lot of things are called “attention” these days...
with variable-length inputs and outputs (typical sequential).
convolutional networks for sequential data.
The shared idea: focus on the relevant parts of the input (output).
67
Attention in Deep Learning Applications [to Language Processing]
machine translation speech recognition speech synthesis, summarization, … any sequence-to-sequence (seq2seq) task
68
Traditional deep learning approach
input → d-dimensional feature vector → layer1 → .... → layerk → output Good for: image classification, phoneme recognition, decision-making in reflex agents (ATARI) Less good for: text classification Not really good for: … everything else?!
69
Example: Machine Translation
[“An”, “RNN”, “example”, “.”] → [“Un”, “example”, “de”, “RNN”, “.”] Machine translation presented a challenge to vanilla deep learning
in the output
70
Vanilla seq2seq learning for machine translation
Recurrent Continuous Translation Models, Kalchbrenner et al, EMNLP 2013 Sequence to Sequence Learning with Recurrent Neural Networks, Sutskever et al., NIPS 2014 Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, Cho et al., EMNLP 2014
input sequence
fixed size representation
71
Problems with vanilla seq2seq
models are needed
be slow and require lots of data
bottleneck looong term dependencies
72
Soft attention
lets decoder focus on the relevant hidden states
into the last hidden state ⇒ no bottleneck! dynamically creates shortcuts in the computation graph that allow the gradient to flow freely ⇒ shorter dependencies! best with a bidirectional encoder
73
Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau et al, ICLR 2015
Soft attention - math 1
At each step the decoder consumes a different weighted combination
74
Soft attention - math 2
But where do the weights come from? They are computed by another network! The choice from the original paper is 1-layer MLP:
75
Soft attention - computational aspects
The computational complexity of using soft attention is quadratic. But it’s not slow:
○
sum two vectors
○
apply tanh
○
compute dot product
○
add a vector to a matrix
○
apply tanh
○
compute vector-matrix product
76
Soft attention - visualization
[penalty???]
Great visualizations at http://distill.pub/2016/augmented-rnns/#attentional-interfaces
77
Great visualizations at https://distill.pub/2016/augmented-rnns/#attentional-interfaces
Soft attention - improvements
no performance drop on long sentences much better than RNN Encoder-Decoder without unknown words comparable with the SMT system
78
Soft content-based attention pros and cons
Pros
Cons
alignment (handwriting recognition, speech recognition)
79
Location-based attention
BiRNN)
are computed relative to the previous attention weights
80
Gaussian mixture location-based attention
Originally proposed for handwriting synthesis. The (unnormalized) weight of the input position u at the time step t is parametrized as a mixture of K Gaussians
81
Section 5, Generating Sequence with Recurrent Neural Networks, A. Graves 2014
Gaussian mixture location-based attention
The new locations of Gaussians are computed as a sum of the previous ones and the predicted offsets
82
Gaussian mixture location-based attention
The first soft attention mechanism ever! Pros:
Cons:
83
Various soft-attentions
attention
used in Memory Networks)
See “Attention-Based Models for Speech Recognition” by Chorowski et al (2015) for a scalability analysis of various attention mechanisms on speech recognition.
84
Going back in time: Connection Temporal Classification (CTC)
that is still widely used
bias for monotonous seq2seq transduction
ways of inserting blank tokens in the output so that it aligns with the input
85
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, Graves et al, ICML 2006
CTC
labeling input sum over all labelling with blanks conditional probability of a labeling with blanks probability of
at the step t
86
CTC
a monotonic alignment
can be carried out with forward-backward algorithm (a.k.a. dynamic programming)
recognition
and x but this can be fixed
87
Soft Attention and CTC for seq2seq: summary
attention and it is very widely used, especially in natural language processing
input and the output can be monotonously aligned; location-based and content-based approaches can be mixed
monotonous alignments
88
Visual and Hard Attention
89
Models of Visual Attention
resolution.
space to acquire information when and where it is needed, and combine information from different fixations over time to build up an internal representation of the scene” (Mnih et al, 2014)
an input image at each step and combines information from multiple steps
90
Recurrent Models of Visual Attention, V. Mnih et al, NIPS 2014
A Recurrent Model of Visual Attention
“retina-like” representation glimpse location (sampled from a Gaussian) RNN state action (e.g. output a class)
91
A Recurrent Model of Visual Attention - math 1
Objective: When used for classification the correct class is known. Instead of sampling the actions the following expression is used as a reward: ⇒ optimizes Jensen lower bound on the log-probability p(a*|x)!
interaction sequence sum of rewards
92
A Recurrent Model of Visual Attention
The gradient of J has to be approximated (REINFORCE) Baseline is used to lower the variance of the estimator:
next action
93
A Recurrent Visual Attention Model - visualization
94
Soft and Hard Attention
RAM attention mechanism is hard - it outputs a precise location where to look. Content-based attention from neural MT is soft - it assigns weights to all input locations. CTC can be interpreted as a hard attention mechanism with tractable gradient.
95
Soft and Hard Attention
Soft
Hard
* deterministic hard attention would not have gradients ** exact gradient can be computed for models with tractable marginalization (e.g. CTC)
96
Soft and Hard Attention
Can soft content-based attention be used for vision? Yes.
Show Attend and Tell, Xu et al, ICML 2015
Can hard attention be used for seq2seq? Yes.
Learning Online Alignments with Continuous Rewards Policy Gradient, Luo et al, NIPS 2016 (but the learning curves are a nightmare…)
97
DRAW: soft location-based attention for vision
98
Internal self-attention in deep learning models
In addition to connecting the decoder with the encoder, attention can be used inside the model, replacing RNN and CNN! Transformer from Google
Attention Is All You Need, Vaswani et al, NIPS 2017
99
keys values queries
Generalized dot-product attention - vector form
100
Generalized dot-product attention - matrix form
queries, values
101
Three types of attention in Transformer
Q=[current state] K=V=[BiRNN states]
Q=K=V=[encoder states]
but a states can only attend previous states) Q=K=V=[decoder states]
102
Summary
○
connecting encoder and decoder in sequence-to-sequence task
○
achieving scale-invariance and focus in image processing
○
self-attention can be a basic building block for neural nets, often replacing RNNs and CNNs [recent research, take it with a grain of salt]
105
106
Attention Is All You Need, Vaswani et al, NIPS 2017
Transformer Model
model (from the original paper)
stack of encoders (6 in this paper)
also a stack of decoders of the same number
107
Transformer Model: Encoder
down into 2 parts
108
Transformer Model: Encoder
109
Transformer Model: Encoder
didn't cross the street because it was too tired”
110
Self-Attention: Step 1 (Create Vectors)
111
Self-Attention: Step 2 (Calculate score), 3 and 4
112
Self-Attention:
Step 5
vector by the softmax score
value vectors
113
Self-Attention: Matrix Form
114
Self-Attention:
Multiple Heads
115
Self-Attention: Multiple Heads
116
Self-Attention: Multiple Heads
focusing (the model’s repr of “i “it” ” has some of “an animal al” and “t “tired” ed”)
harder to interpret
117
Positional Embeddings
118
What does it look like?
Positional Embeddings
119
The Residuals
followed by a layer normalization
120
The Residuals
sub-layers in decoder as well
121
The Decoder
y at attend to ear arlier posi sitions s in the output sequence.
future positions (setting them to -in inf before the softmax in calculation)
122
Final Layer
123
y at attend to ear arlier posi sitions s in the output sequence.
future positions (setting them to -inf f before the softmax in calculation)
Results
124
Generating Wikipedia by Summarizing Long Sequences ROUGE seq2seq-attention 12.7 Transformer-ED (L=500) 34.2 Transformer-DMCA (L=11000) 36.2
msaleh@ et al. submission to ICLR’18
Machine Translation: WMT-2014 BLEU
EN-DE EN-FR GNMT (orig) 24.6 39.9 ConvSeq2Seq 25.2 40.5 Transformer* 28.4 41.8
Attention is All You Need (NeurIPS 2017) Vaswani*, Shazeer*, Parmar*, Uszkoreit*, Jones*, Kaiser*, Gomez*, Polosukhin* *Transformer models trained >3x faster than the others.
Attention Is All You Need, Vaswani et al, NIPS 2017
Results
What Matters
attention key size hurts the model
model is better
helpful
learned positional emb have same results
125
126