CMP722 ADVANCED COMPUTER VISION Lecture #3 Sequential Processing - - PowerPoint PPT Presentation

cmp722
SMART_READER_LITE
LIVE PREVIEW

CMP722 ADVANCED COMPUTER VISION Lecture #3 Sequential Processing - - PowerPoint PPT Presentation

Illustration: DeepMind CMP722 ADVANCED COMPUTER VISION Lecture #3 Sequential Processing with NNs and Attention Aykut Erdem // Hacettepe University // Spring 2019 Illustration: Koma Zhang // Quanta Magazine Previously on CMP722 deep


slide-1
SLIDE 1

Lecture #3 – Sequential Processing with NNs and Attention

Aykut Erdem // Hacettepe University // Spring 2019

CMP722

ADVANCED COMPUTER VISION

Illustration: DeepMind

slide-2
SLIDE 2
  • deep learning
  • computation in a neural net
  • optimization
  • backpropagation
  • training tricks
  • convolutional neural networks

Previously on CMP722

Illustration: Koma Zhang // Quanta Magazine

slide-3
SLIDE 3

Good news, everyone!

  • We will be hearing about

your project proposals next week!

  • Paper presentations will

start in 2 weeks time. Choose your papers!

3 3

slide-4
SLIDE 4

Lecture overview

  • sequential data
  • convolutions in time
  • recurrent neural networks (RNNs)
  • autoregressive generative models
  • attention models
  • case study: transformer model
  • Disclaimer: Much of the material and slides for this lecture were borrowed from

—Bill Freeman, Antonio Torralba and Phillip Isola’s MIT 6.869 class —Aaron van den Oord’s talk on “Neural Discrete Representation Learning” —Dzmitry Bahdanau’s IFT 6266 slides —Arian Hosseini’s IFT 6135 slides

4

slide-5
SLIDE 5

Sequences

5

slide-6
SLIDE 6

[http://moviebarcode.tumblr.com/]

time

6

slide-7
SLIDE 7

Convolutions in time

time

7

slide-8
SLIDE 8

It bothered him that the dog at three fourteen (seen from the side) should have the same name as the dog at three fifteen (seen from the front). — “Funes the Memorius”, Borges 1962

“The Persistence of Memory”, Dali 1931

8

slide-9
SLIDE 9

[https://www.youtube.com/watch?v=wxfGT-kKxiM

9

slide-10
SLIDE 10

time

Rufus

10

slide-11
SLIDE 11

time

Douglas

11

slide-12
SLIDE 12

time

Rufus Memory unit

12

slide-13
SLIDE 13

time

Rufus Memory unit Rufus!

13

slide-14
SLIDE 14

Recurrent Neural Networks (RNNs)

Hidden Outputs Inputs

14

slide-15
SLIDE 15

Hidden Outputs Inputs

time

Recurrent Neural Networks (RNNs)

15

slide-16
SLIDE 16

time

Hidden Outputs Inputs

Recurrent Neural Networks (RNNs)

16

slide-17
SLIDE 17

Hidden Outputs Inputs Recurrent!

Recurrent Neural Networks (RNNs)

17

slide-18
SLIDE 18

time

Hidden Outputs Inputs

Recurrent Neural Networks (RNNs)

18

slide-19
SLIDE 19

time

Hidden Outputs Inputs …

Deep Recurrent Neural Networks (RNNs)

19

slide-20
SLIDE 20

Backprop through time

time

Hidden Outputs Inputs

20

slide-21
SLIDE 21

time

Hidden Outputs Inputs

21

slide-22
SLIDE 22

Recurrent linear layer

22

slide-23
SLIDE 23

The problem of long-range dependences

time

Hidden Outputs Inputs

  • Capturing long-range dependences requires propagating information

through a long chain of dependences.

  • Old observations are forgotten
  • Stochastic gradients become high variance (noisy), and gradients may

vanish or explode

23

slide-24
SLIDE 24

time

Rufus Memory unit Rufus!

24

slide-25
SLIDE 25

time

Memory units …

25

slide-26
SLIDE 26

The problem of long-range dependences

Why not remember everything?

  • Memory size grows with t
  • This kind of memory is nonparametric: there is no finite set of

parameters we can use to model it

  • RNNs make a Markov assumption — the future hidden state only

depends on the immediately preceding hidden state

  • By putting the right info into the hidden state, RNNs can model

dependences that are arbitrarily far apart

26

slide-27
SLIDE 27

The problem of long-range dependences

Other methods exist that do directly link old “memories” (observations or hidden states) to future predictions:

  • Temporal convolutions
  • Attention (see https://arxiv.org/abs/1706.03762)
  • Memory networks (see https://arxiv.org/abs/1410.3916)

27

slide-28
SLIDE 28

Long Short Term Memory (LSTM)

A special kind of RNN designed to avoid forgetting. Related to resnets: inductive bias is that state transition is an identity function. This way the default behavior is not to forget an old state. Instead of forgetting by default, the network has to learn to forget.

28

slide-29
SLIDE 29

29

[Slide derived from Chris Olah: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]

slide-30
SLIDE 30

30

[Slide derived from Chris Olah: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]

slide-31
SLIDE 31

Ct = Cell state

31

[Slide derived from Chris Olah: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]

slide-32
SLIDE 32

Decide what information to throw away from the cell state. Each element of cell state is multiplied by ~1 (remember) or ~0 (forget).

32

[Slide derived from Chris Olah: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]

slide-33
SLIDE 33

Decide what new information to add to the cell state. which indices to write to what to write to those indices

33

[Slide derived from Chris Olah: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]

slide-34
SLIDE 34

Forget old selected old information, write selected new information.

34

[Slide derived from Chris Olah: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]

slide-35
SLIDE 35

After having updated the cell state’s information, decide what to output.

35

[Slide derived from Chris Olah: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]

slide-36
SLIDE 36

p

Synthesizing a pixel non-parametric sampling Input image

[Efros & Leung 1999]

Models

Texture synthesis by non-parametric sampling

36

slide-37
SLIDE 37

[PixelRNN, PixelCNN, van der Oord et al. 2016]

Input partial image “white” Predicted color

  • f next pixel

Texture synthesis with a deep net

37

slide-38
SLIDE 38

Input partial image Predicted color

  • f next pixel

“white” …

38

[PixelRNN, PixelCNN, van der Oord et al. 2016]

slide-39
SLIDE 39

1

Prediction for a single pixel i,j

green gray blue teal brown red violet

  • range

Idea: We can represent colors as discrete classes

39

slide-40
SLIDE 40

Softmax regression (a.k.a. multinomial logistic regression) predicted probability of each class given input x max likelihood learner! picks out the -log likelihood

  • f the ground truth class

under the model prediction

And we can interpret the learner as modeling P(next pixel | previous pixels):

40

slide-41
SLIDE 41

Network output

turquoise blue green red

  • range

gray black white

p

probability

P(next pixel | previous pixels)

41

slide-42
SLIDE 42

Network output

turquoise blue green red

  • range

gray black white

p

probability

42

slide-43
SLIDE 43

Network output

turquoise blue green red

  • range

gray black white

probability

43

slide-44
SLIDE 44

Network output

turquoise blue green red

  • range

gray black white

probability

44

slide-45
SLIDE 45

Network output

turquoise blue green red

  • range

gray black white

probability

45

slide-46
SLIDE 46

46

slide-47
SLIDE 47

General product rule

The sampling procedure we defined above takes exact samples from the learned probability distribution (pmf). Multiplying all conditionals evaluates the probability of a full joint configuration

  • f pixels.

Autoregressive probability model

47

slide-48
SLIDE 48

Models that allow us to sample, i.e. generat ate, images from scratch are called generative models. We will see more examples in a future lecture.

Autoregressive probability model

48

slide-49
SLIDE 49

[PixelRNN, van der Oord et al. 2016]

Samples from PixelRNN

49

slide-50
SLIDE 50

[PixelRNN, van der Oord et al. 2016]

Image completions (conditional samples) from PixelRNN

  • ccluded

completions

  • riginal

50

slide-51
SLIDE 51

Modeling Audio

51

slide-52
SLIDE 52

Causal Convolution

52

Causal Convolution

Input Hidden Layer

Hidde Hidden Layer Layer Input Input

slide-53
SLIDE 53

Causal Convolution

53

Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer

slide-54
SLIDE 54

Causal Convolution

54

Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer Hidde Hidden Layer Layer

slide-55
SLIDE 55

Causal Convolution

55

Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer Hidde Hidden Layer Layer Out Output put

slide-56
SLIDE 56

Causal Convolution

56

Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer Hidde Hidden Layer Layer Out Output put

slide-57
SLIDE 57

Causal Dilated Convolution

57

Input Input

slide-58
SLIDE 58

Causal Dilated Convolution

58

Hidde Hidden Layer Layer Input Input

slide-59
SLIDE 59

Causal Dilated Convolution

59

Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer dilation=1 dilation=2

slide-60
SLIDE 60

Causal Dilated Convolution

60

Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer Hidde Hidden Layer Layer dilation=1 dilation=2 dilation=4

slide-61
SLIDE 61

Causal Dilated Convolution

61

Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer Hidde Hidden Layer Layer Out Output put dilation=1 dilation=2 dilation=4 dilation=8

slide-62
SLIDE 62

Causal Dilated Convolution

62

Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer Hidde Hidden Layer Layer Out Output put dilation=1 dilation=2 dilation=4 dilation=8

slide-63
SLIDE 63

Multiple Stacks

63

slide-64
SLIDE 64

Sampling

64

Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer Hidde Hidden Layer Layer Out Output put

slide-65
SLIDE 65

Sampling

65

Hidde Hidden Layer Layer Input Input Hidde Hidden Layer Layer Hidde Hidden Layer Layer Out Output put sample speech sample music

slide-66
SLIDE 66

Attention Models in Deep Learning

66

slide-67
SLIDE 67

A lot of things are called “attention” these days...

  • 1. Attention (alignment) models used in applications of deep supervised learning

with variable-length inputs and outputs (typical sequential).

  • 2. Models of visual attention that process a region of an image at high resolution
  • r the whole image at low resolution.
  • 3. Internal self-attention mechanisms can be used to replace recurrent and

convolutional networks for sequential data.

  • 4. Addressing schemes of memory-augmented neural networks

The shared idea: focus on the relevant parts of the input (output).

67

slide-68
SLIDE 68

Attention in Deep Learning Applications [to Language Processing]

machine translation speech recognition speech synthesis, summarization, … any sequence-to-sequence (seq2seq) task

68

slide-69
SLIDE 69

Traditional deep learning approach

input → d-dimensional feature vector → layer1 → .... → layerk → output Good for: image classification, phoneme recognition, decision-making in reflex agents (ATARI) Less good for: text classification Not really good for: … everything else?!

69

slide-70
SLIDE 70

Example: Machine Translation

[“An”, “RNN”, “example”, “.”] → [“Un”, “example”, “de”, “RNN”, “.”] Machine translation presented a challenge to vanilla deep learning

  • input and output are sequences
  • the lengths vary
  • input and output may have different lengths
  • no obvious correspondence between positions in the input and

in the output

70

slide-71
SLIDE 71

Vanilla seq2seq learning for machine translation

Recurrent Continuous Translation Models, Kalchbrenner et al, EMNLP 2013 Sequence to Sequence Learning with Recurrent Neural Networks, Sutskever et al., NIPS 2014 Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, Cho et al., EMNLP 2014

input sequence

  • utput sequence

fixed size representation

71

slide-72
SLIDE 72

Problems with vanilla seq2seq

  • training the network to encode 50 words in a vector is hard ⇒ very big

models are needed

  • gradients has to flow for 50 steps back without vanishing ⇒ training can

be slow and require lots of data

bottleneck looong term dependencies

72

slide-73
SLIDE 73

Soft attention

lets decoder focus on the relevant hidden states

  • f the encoder, avoids squeezing everything

into the last hidden state ⇒ no bottleneck! dynamically creates shortcuts in the computation graph that allow the gradient to flow freely ⇒ shorter dependencies! best with a bidirectional encoder

73

Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau et al, ICLR 2015

slide-74
SLIDE 74

Soft attention - math 1

At each step the decoder consumes a different weighted combination

  • f the encoder states, called context vector or glimpse.

74

slide-75
SLIDE 75

Soft attention - math 2

But where do the weights come from? They are computed by another network! The choice from the original paper is 1-layer MLP:

75

slide-76
SLIDE 76

Soft attention - computational aspects

The computational complexity of using soft attention is quadratic. But it’s not slow:

  • for each pair of i and j

sum two vectors

apply tanh

compute dot product

  • can be done in parallel for all j, i.e.

add a vector to a matrix

apply tanh

compute vector-matrix product

  • softmax is cheap
  • weighted combination is another vector-matrix product
  • in summary: just vector-matrix products = fast!

76

slide-77
SLIDE 77

Soft attention - visualization

[penalty???]

Great visualizations at http://distill.pub/2016/augmented-rnns/#attentional-interfaces

77

Great visualizations at https://distill.pub/2016/augmented-rnns/#attentional-interfaces

slide-78
SLIDE 78

Soft attention - improvements

no performance drop on long sentences much better than RNN Encoder-Decoder without unknown words comparable with the SMT system

78

slide-79
SLIDE 79

Soft content-based attention pros and cons

Pros

  • faster training, better performance
  • good inductive bias for many tasks => lowers sample complexity

Cons

  • not good enough inductive bias for tasks with monotonic

alignment (handwriting recognition, speech recognition)

  • chokes on sequences of length >1000

79

slide-80
SLIDE 80

Location-based attention

  • in content-based attention the attention weights depend
  • n the content at different positions of the input (hence

BiRNN)

  • in location-based attention the current attention weights

are computed relative to the previous attention weights

80

slide-81
SLIDE 81

Gaussian mixture location-based attention

Originally proposed for handwriting synthesis. The (unnormalized) weight of the input position u at the time step t is parametrized as a mixture of K Gaussians

81

Section 5, Generating Sequence with Recurrent Neural Networks, A. Graves 2014

slide-82
SLIDE 82

Gaussian mixture location-based attention

The new locations of Gaussians are computed as a sum of the previous ones and the predicted offsets

82

slide-83
SLIDE 83

Gaussian mixture location-based attention

The first soft attention mechanism ever! Pros:

  • good for problems with monotonic alignment

Cons:

  • predicting the offset can be challenging
  • nly monotonic alignment (although exp in theory could be removed)

83

slide-84
SLIDE 84

Various soft-attentions

  • use dot-product or non-linearity of choice instead of tanh in content-based

attention

  • use unidirectional RNN insteaf of Bi- (but not pure word embeddings!)
  • explicitly remember past alignments with an RNN
  • use a separate embedding for each of the positions of the input (heavily

used in Memory Networks)

  • mix content-based and location-based attentions

See “Attention-Based Models for Speech Recognition” by Chorowski et al (2015) for a scalability analysis of various attention mechanisms on speech recognition.

84

slide-85
SLIDE 85

Going back in time: Connection Temporal Classification (CTC)

  • CTC is a predecessor of soft attention

that is still widely used

  • has very successful inductive

bias for monotonous seq2seq transduction

  • core idea: sum over all possible

ways of inserting blank tokens in the output so that it aligns with the input

85

Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, Graves et al, ICML 2006

slide-86
SLIDE 86

CTC

labeling input sum over all labelling with blanks conditional probability of a labeling with blanks probability of

  • utputting \pi_t

at the step t

86

slide-87
SLIDE 87

CTC

  • can be viewed as modelling p(y|x) as sum of all p(y|a,x), where a is

a monotonic alignment

  • thanks to the monotonicity assumption the marginalization of a

can be carried out with forward-backward algorithm (a.k.a. dynamic programming)

  • hard stochastic monotonic attention
  • popular in speech and handwriting

recognition

  • y_i are conditionally independent given a

and x but this can be fixed

87

slide-88
SLIDE 88

Soft Attention and CTC for seq2seq: summary

  • the most flexible and general is content-based soft

attention and it is very widely used, especially in natural language processing

  • location-based soft attention is appropriate for when the

input and the output can be monotonously aligned; location-based and content-based approaches can be mixed

  • CTC is less generic but can be hard to beat on tasks with

monotonous alignments

88

slide-89
SLIDE 89

Visual and Hard Attention

89

slide-90
SLIDE 90

Models of Visual Attention

  • Convnets are great! But they process the whole image at a high

resolution.

  • “Instead humans focus attention selectively on parts of the visual

space to acquire information when and where it is needed, and combine information from different fixations over time to build up an internal representation of the scene” (Mnih et al, 2014)

  • hence the idea: build a recurrent network that focus on a patch of

an input image at each step and combines information from multiple steps

90

Recurrent Models of Visual Attention, V. Mnih et al, NIPS 2014

slide-91
SLIDE 91

A Recurrent Model of Visual Attention

“retina-like” representation glimpse location (sampled from a Gaussian) RNN state action (e.g. output a class)

91

slide-92
SLIDE 92

A Recurrent Model of Visual Attention - math 1

Objective: When used for classification the correct class is known. Instead of sampling the actions the following expression is used as a reward: ⇒ optimizes Jensen lower bound on the log-probability p(a*|x)!

interaction sequence sum of rewards

92

slide-93
SLIDE 93

A Recurrent Model of Visual Attention

The gradient of J has to be approximated (REINFORCE) Baseline is used to lower the variance of the estimator:

next action

93

slide-94
SLIDE 94

A Recurrent Visual Attention Model - visualization

94

slide-95
SLIDE 95

Soft and Hard Attention

RAM attention mechanism is hard - it outputs a precise location where to look. Content-based attention from neural MT is soft - it assigns weights to all input locations. CTC can be interpreted as a hard attention mechanism with tractable gradient.

95

slide-96
SLIDE 96

Soft and Hard Attention

Soft

  • deterministic
  • exact gradient
  • O(input size)
  • typically easy to train

Hard

  • stochastic*
  • gradient approximation**
  • O(1)
  • harder to train

* deterministic hard attention would not have gradients ** exact gradient can be computed for models with tractable marginalization (e.g. CTC)

96

slide-97
SLIDE 97

Soft and Hard Attention

Can soft content-based attention be used for vision? Yes.

Show Attend and Tell, Xu et al, ICML 2015

Can hard attention be used for seq2seq? Yes.

Learning Online Alignments with Continuous Rewards Policy Gradient, Luo et al, NIPS 2016 (but the learning curves are a nightmare…)

97

slide-98
SLIDE 98

DRAW: soft location-based attention for vision

98

slide-99
SLIDE 99

Internal self-attention in deep learning models

In addition to connecting the decoder with the encoder, attention can be used inside the model, replacing RNN and CNN! Transformer from Google

Attention Is All You Need, Vaswani et al, NIPS 2017

99

slide-100
SLIDE 100

keys values queries

  • utputs

Generalized dot-product attention - vector form

100

slide-101
SLIDE 101

Generalized dot-product attention - matrix form

  • rows of Q, K, V are keys,

queries, values

  • softmax acts row-wise

101

slide-102
SLIDE 102

Three types of attention in Transformer

  • usual attention between encoder and decoder:

Q=[current state] K=V=[BiRNN states]

  • self-attention in the encoder (encoder attends to itself!)

Q=K=V=[encoder states]

  • masked self-attention in the decoder (attends to itself,

but a states can only attend previous states) Q=K=V=[decoder states]

102

slide-103
SLIDE 103

Summary

  • attention is used to focus on parts of inputs/outputs
  • it can be content/location based and hard/soft
  • it’s three main distinct uses are

connecting encoder and decoder in sequence-to-sequence task

achieving scale-invariance and focus in image processing

self-attention can be a basic building block for neural nets, often replacing RNNs and CNNs [recent research, take it with a grain of salt]

105

slide-104
SLIDE 104

Case Study: Transformer Model

106

Attention Is All You Need, Vaswani et al, NIPS 2017

slide-105
SLIDE 105

Transformer Model

  • It is a sequence to sequence

model (from the original paper)

  • the encoder component is a

stack of encoders (6 in this paper)

  • the decoding component is

also a stack of decoders of the same number

107

slide-106
SLIDE 106

Transformer Model: Encoder

  • The encoder can be broken

down into 2 parts

108

slide-107
SLIDE 107

Transformer Model: Encoder

109

slide-108
SLIDE 108

Transformer Model: Encoder

  • Example: “The animal

didn't cross the street because it was too tired”

  • Associate “it” with “animal”
  • look for clues when encoding

110

slide-109
SLIDE 109

Self-Attention: Step 1 (Create Vectors)

  • Abstractions useful for calculating and thinking about attention

111

slide-110
SLIDE 110

Self-Attention: Step 2 (Calculate score), 3 and 4

112

slide-111
SLIDE 111

Self-Attention:

Step 5

  • multiply each value

vector by the softmax score

  • sum up the weighted

value vectors

  • produces the output

113

slide-112
SLIDE 112

Self-Attention: Matrix Form

114

slide-113
SLIDE 113

Self-Attention:

Multiple Heads

115

slide-114
SLIDE 114

Self-Attention: Multiple Heads

116

slide-115
SLIDE 115

Self-Attention: Multiple Heads

  • Where different attention heads are

focusing (the model’s repr of “i “it” ” has some of “an animal al” and “t “tired” ed”)

  • With all heads in the picture, things are

harder to interpret

117

slide-116
SLIDE 116

Positional Embeddings

  • To give the model a sense of order
  • Learned or predefined

118

slide-117
SLIDE 117

What does it look like?

Positional Embeddings

  • What does it look like?

119

slide-118
SLIDE 118

The Residuals

  • Each sub-layer in each encoder has a residual connection around it

followed by a layer normalization

120

slide-119
SLIDE 119

The Residuals

  • This goes for

sub-layers in decoder as well

121

slide-120
SLIDE 120

The Decoder

  • The self-attention can
  • nly

y at attend to ear arlier posi sitions s in the output sequence.

  • Done by masking the

future positions (setting them to -in inf before the softmax in calculation)

122

slide-121
SLIDE 121

Final Layer

123

  • The self-attention can
  • nly

y at attend to ear arlier posi sitions s in the output sequence.

  • Done by masking the

future positions (setting them to -inf f before the softmax in calculation)

slide-122
SLIDE 122

Results

  • Machine Translation: WMT-2014 BLEU
  • Transformer models trained >3x faster than the others

124

Generating Wikipedia by Summarizing Long Sequences ROUGE seq2seq-attention 12.7 Transformer-ED (L=500) 34.2 Transformer-DMCA (L=11000) 36.2

msaleh@ et al. submission to ICLR’18

Machine Translation: WMT-2014 BLEU

EN-DE EN-FR GNMT (orig) 24.6 39.9 ConvSeq2Seq 25.2 40.5 Transformer* 28.4 41.8

Attention is All You Need (NeurIPS 2017) Vaswani*, Shazeer*, Parmar*, Uszkoreit*, Jones*, Kaiser*, Gomez*, Polosukhin* *Transformer models trained >3x faster than the others.

Attention Is All You Need, Vaswani et al, NIPS 2017

slide-123
SLIDE 123

Results

What Matters

  • row B: reducing

attention key size hurts the model

  • row C: bigger

model is better

  • row D: dropout is

helpful

  • sinusoidal with

learned positional emb have same results

125

slide-124
SLIDE 124

Next Lecture: Multimodality

126