At Attent ntio ion The The proble lem For very long sentences, - - PowerPoint PPT Presentation

at attent ntio ion the the proble lem
SMART_READER_LITE
LIVE PREVIEW

At Attent ntio ion The The proble lem For very long sentences, - - PowerPoint PPT Presentation

At Attent ntio ion The The proble lem For very long sentences, the score for machine translation really goes down after 30-40 words. With attention Performance degradation Bahdanau et al 2014. Neural machine translation by jointly


slide-1
SLIDE 1

At Attent ntio ion

slide-2
SLIDE 2

The The proble lem

  • For very long sentences, the score for machine

translation really goes down after 30-40 words.

  • Prof. Leal-Taixé and Prof. Niessner

Bahdanau et al 2014. Neural machine translation by jointly learning to align and translate.

With attention Performance degradation

2

slide-3
SLIDE 3

Bas Basic structure e of

  • f a

a RN RNN

  • We want to have notion of “time” or “sequence”
  • Prof. Leal-Taixé and Prof. Niessner

[Christopher Olah] Understanding LSTMs

Hidden state input Previous hidden state

3

slide-4
SLIDE 4

Bas Basic structure e of

  • f a

a RN RNN

  • We want to have notion of “time” or “sequence”
  • Prof. Leal-Taixé and Prof. Niessner

Hidden state Parameters to be learned

4

slide-5
SLIDE 5

Bas Basic structure e of

  • f a

a RN RNN

  • We want to have notion of “time” or “sequence”
  • Prof. Leal-Taixé and Prof. Niessner

Hidden state Same parameters for each time step = generalization! Output

5

slide-6
SLIDE 6

Bas Basic structure e of

  • f a

a RN RNN

  • Unrolling RNNs
  • Prof. Leal-Taixé and Prof. Niessner

[Christopher Olah] Understanding LSTMs

Hidden state is the same

6

slide-7
SLIDE 7

Bas Basic structure e of

  • f a

a RN RNN

  • Unrolling RNNs
  • Prof. Leal-Taixé and Prof. Niessner

7

slide-8
SLIDE 8

Lo Long ng-te term depend ndenci ncies

  • Prof. Leal-Taixé and Prof. Niessner

I mo moved to Germany any … so I speak German an fluently

8

slide-9
SLIDE 9

Atte Attenti ntion: n: intu ntuiti tion

  • Prof. Leal-Taixé and Prof. Niessner

I mo moved to Germany any … so I speak German an fluently

ATTENTION: Which hidden states are more important to predict my output?

9

slide-10
SLIDE 10

Atte Attenti ntion: n: intu ntuiti tion

  • Prof. Leal-Taixé and Prof. Niessner

Context

I mo moved to Germany any … so I speak German an fluently

αt,t+1 αt+1,t+1 α1,t+1

10

slide-11
SLIDE 11

Atte Attenti ntion: n: archi hite tectu ture

  • A decoder processes

the information

  • Decoders take as

input:

– Previous decoder hidden state – Previous output – Attention

  • Prof. Leal-Taixé and Prof. Niessner

D D D

Context

αt,t+1 αt+1,t+1

11

slide-12
SLIDE 12

Atte Attenti ntion

  • indicates how much the word in the position

is important to translate the work in position

  • The context aggregates the attention
  • So

Soft ft attention: All attention masks alpha sum up to 1

  • Prof. Leal-Taixé and Prof. Niessner

α1,t+1 t + 1 + 1 ct+1 =

t+1

X

k=1

αk,t+1ak

12

slide-13
SLIDE 13

Comp Computin ing the e atten ention ion ma mask

  • We can train a small neural network
  • Normalize
  • Prof. Leal-Taixé and Prof. Niessner

NN

a1 dt

Hidden state of the encoder Previous state of the decoder

f1,t+1 α1,t+1 = expf1,t+1 Pt+1

k=1 expfk,t+1

13

slide-14
SLIDE 14

At Attent ntio ion n fo for vis visio ion

slide-15
SLIDE 15

Wh Why do do we e need eed at atten ention

  • n?
  • Prof. Leal-Taixé and Prof. Niessner

BIRD

  • We use the whole image to make the classification
  • Are all pixels equally important?

15

slide-16
SLIDE 16

Wh Why do do we e need eed at atten ention

  • n?
  • Prof. Leal-Taixé and Prof. Niessner
  • Wouldn’t it be easier and computationally more efficient

to just run our classification network on the patch?

16

slide-17
SLIDE 17

Sof Soft t atten ttenti tion

  • n for
  • r

ca capti tion

  • ning
slide-18
SLIDE 18

Im Image captioning

  • Prof. Leal-Taixé and Prof. Niessner

18

Xu et al 2015. Show attention and tell: neural image caption generation with visual attention.

slide-19
SLIDE 19

Im Image captioning

  • Input: image
  • Output: a sentence describing the image.
  • Enc

Encoder: a classification CNN (VGGNet, AlexNet). This computes a feature maps over the image.

  • De

Decoder: an attention-based RNN

– In each time step, the decoder computes an attention map over the entire image, effectively deciding which regions to focus on. – It receives a context vector, which is the weighted average of the conv net features.

  • Prof. Leal-Taixé and Prof. Niessner

19

slide-20
SLIDE 20

Con Conven ention ional caption ionin ing

Encoder Decoder

Image from: https://blog.heuritech.com/2016/01/20/attention-mechanism/

  • Prof. Leal-Taixé and Prof. Niessner

20

LSTM only sees the image once!

slide-21
SLIDE 21

Atte Attenti ntion n mecha hani nism

A girl is throwing a frisbee in the park

  • Prof. Leal-Taixé and Prof. Niessner

21

slide-22
SLIDE 22

Atte Attenti ntion n mecha hani nism

A girl is throwing a frisbee in the park

  • Prof. Leal-Taixé and Prof. Niessner

22

slide-23
SLIDE 23

Atte Attenti ntion n mecha hani nism

A girl is throwing a frisbee in the park

  • Prof. Leal-Taixé and Prof. Niessner

23

slide-24
SLIDE 24

Atte Attenti ntion n mecha hani nism

A girl is throwing a frisbee in the park

  • Prof. Leal-Taixé and Prof. Niessner

24

slide-25
SLIDE 25

Atte Attenti ntion n mecha hani nism

y": Output of encoder are the image features which still retain spatial information (no FC layer!) Z": Output of attention model h": Hidden state of LSTM

  • Prof. Leal-Taixé and Prof. Niessner

25

slide-26
SLIDE 26

Atte Attenti ntion n mecha hani nism

How does the attention model look like?

  • Prof. Leal-Taixé and Prof. Niessner

26

slide-27
SLIDE 27

Atte Attenti ntion n model

  • Attention architecture

Image: https://blog.heuritech.com/2016/01/20/attention-mechanism/

  • Prof. Leal-Taixé and Prof. Niessner

27

Any past hidden state Visual features Output attention

slide-28
SLIDE 28

Atte Attenti ntion n model

  • Inputs = feature descriptor for each image patch
  • Prof. Leal-Taixé and Prof. Niessner

28

slide-29
SLIDE 29

Atte Attenti ntion n model

  • Inputs = feature descriptor for each image patch
  • Prof. Leal-Taixé and Prof. Niessner

29

Still related to the spatial location of the image

slide-30
SLIDE 30

Atte Attenti ntion n model

  • We want an bounded output

!" = tanh (

)*+ + (

  • * ."
  • Prof. Leal-Taixé and Prof. Niessner

30

slide-31
SLIDE 31

Atte Attenti ntion n model

  • Softmax to create the attention values between 0

and 1

  • Prof. Leal-Taixé and Prof. Niessner

31

slide-32
SLIDE 32

Atte Attenti ntion n model

  • Multiplied by the image features à ranking by

importance

  • Prof. Leal-Taixé and Prof. Niessner

32

slide-33
SLIDE 33

Ha Hard a attentio ion mo model

  • Choosing one of the features by sampling with

probabilities

  • Prof. Leal-Taixé and Prof. Niessner

33

si

slide-34
SLIDE 34

Ty Types of atte ttenti ntion

  • Soft at

attent ntion: deterministic process that can be backproped

  • Har

ard at attent ntion: stochastic process, gradient is estimated through Monte Carlo sampling.

  • Soft attention is the most commonly used since it can

be incorporated into the optimization more easily

  • Prof. Leal-Taixé and Prof. Niessner

34

slide-35
SLIDE 35

Ty Types of atte ttenti ntion

  • Soft vs hard attention
  • Prof. Leal-Taixé and Prof. Niessner

35

Soft Hard

slide-36
SLIDE 36

Ty Types of atte ttenti ntion: n: soft

  • Prof. Leal-Taixé and Prof. Niessner

36

Image: Stanford CS231n lecture

Att Attention module Final context

  • Can be backproped
  • Uses all the image
slide-37
SLIDE 37

Ty Types of atte ttenti ntion: n: ha hard

  • Prof. Leal-Taixé and Prof. Niessner

37

Image: Stanford CS231n lecture

  • You can view it as an

image cropping!

  • If we cannot use

gradient descent, what alternative could we use to train this function? Reinforcement Learning

slide-38
SLIDE 38

Im Image captioning with attention

  • Prof. Leal-Taixé and Prof. Niessner

38

Xu et al 2015. Show attention and tell: neural image caption generation with visual attention.

slide-39
SLIDE 39

In Interesting works on attention

  • Luong et al, “Effective Approaches to Attentionbased Neural Machine

Translation,” EMNLP 2015

  • Chan et al, “Listen, Attend, and Spell”, arXiv 2015
  • Chorowski et al, “Attention-based models for Speech Recognition”, NIPS

2015

  • Yao et al, “Describing Videos by Exploiting Temporal Structure”, ICCV

2015

  • Xu and Saenko, “Ask, Attend and Answer: Exploring Question-Guided

Spatial Attention for Visual Question Answering”, arXiv 2015

  • Zhu et al, “Visual7W: Grounded Question Answering in Images”, arXiv

2015

  • Chu et al. „Online Multi-Object Tracking Using CNN-based Single Object

Tracker with Spatial-Temporal Attention Mechanism“. ICCV 2017

  • Prof. Leal-Taixé and Prof. Niessner

39

slide-40
SLIDE 40

Con Condi diti tion

  • ning
slide-41
SLIDE 41

Wh When en do do we e need eed con

  • ndi

dition

  • ning?
  • Scene understanding from an image and an audio
  • source. Both need to be processed!
  • Prof. Leal-Taixé and Prof. Niessner

41

slide-42
SLIDE 42

Wh When en do do we e need eed con

  • ndi

dition

  • ning?
  • Visual Question and Answering: the sentence

(question) needs to be understood, the image is needed to create the answer.

  • Prof. Leal-Taixé and Prof. Niessner

42

slide-43
SLIDE 43

Wh When en do do we e need eed con

  • ndi

dition

  • ning?
  • Visual Question and Answering: the sentence

(question) needs to be understood, the image is needed to create the answer.

  • Prof. Leal-Taixé and Prof. Niessner

43

slide-44
SLIDE 44

Wh When en do do we e need eed con

  • ndi

dition

  • ning?
  • We have two sources, can we process one in

n the he cont ntext of the other?

  • Cond

nditioni ning ng: the computation carried out by a model is conditioned or modulated by information extracted from an auxiliary input.

  • Note: a similar thing can be obtained with attention

(see p. 39)

  • Prof. Leal-Taixé and Prof. Niessner

44

slide-45
SLIDE 45

Wh When en do do we e need eed con

  • ndi

dition

  • ning?
  • Prof. Leal-Taixé and Prof. Niessner

45

  • Generate images based on a word
  • Do we need to retrain a model for each word?

Image: https://distill.pub/2018/feature-wise-transformations/

slide-46
SLIDE 46

Con Concaten enation ion-ba based co conditioning

  • Prof. Leal-Taixé and Prof. Niessner

46

Image: https://distill.pub/2018/feature-wise-transformations/

slide-47
SLIDE 47

Con Concaten enation ion-ba based co conditioning

  • Prof. Leal-Taixé and Prof. Niessner

47

Image: https://distill.pub/2018/feature-wise-transformations/

slide-48
SLIDE 48

Con Concaten enation ion-ba based co conditioning

  • Prof. Leal-Taixé and Prof. Niessner

48

Image: https://distill.pub/2018/feature-wise-transformations/

slide-49
SLIDE 49

Con Concaten enation ion-ba based co conditioning

  • Source: image (high-dimensional) and pose (low-dimensional)

à expressed as an image (same dimensionality)

  • Prof. Leal-Taixé and Prof. Niessner

49

  • L. Ma et al. „Pose guided person image generation“. NIPS 2017
slide-50
SLIDE 50

Con Concaten enation ion-ba based co conditioning

  • Prof. Leal-Taixé and Prof. Niessner

50

  • L. Ma et al. „Pose guided person image generation“. NIPS 2017

Wait for the GAN intro in a few weeks!

  • Source: image (high-dimensional) and pose (low-dimensional)

à expressed as an image (same dimensionality)

slide-51
SLIDE 51

Con Concaten enation ion-ba based co conditioning

  • Sources: image (high-dimensional) and

measurements (low-dimensional)

  • Prof. Leal-Taixé and Prof. Niessner

51

  • A. Dosovitskiy and V. Koltun. Learning to act by predicting the future. ICLR 2017
slide-52
SLIDE 52

Con Condit ition ional bia iasin ing

  • Prof. Leal-Taixé and Prof. Niessner

52

Image: https://distill.pub/2018/feature-wise-transformations/

Think about the similarities with concatenation

  • based

conditioning

slide-53
SLIDE 53

Con Condit ition ional scalin ing

  • Prof. Leal-Taixé and Prof. Niessner

53

Image: https://distill.pub/2018/feature-wise-transformations/

slide-54
SLIDE 54

Con Condit ition ional scalin ing

  • Reminds you of…. Gating

– Long-Short Term Memory units

  • Gating allows you to learn which inputs are more

related between e.g. the two sources

  • All conditioning so far is on a feature level à efficient

and effective à number of parameters to be learned scales linearly with the number of features of the NN

  • Prof. Leal-Taixé and Prof. Niessner

54

slide-55
SLIDE 55

Con Condit ition ional scalin ing

  • Can one do both conditional scaling and biasing?
  • Prof. Leal-Taixé and Prof. Niessner

55

Conditional Affine Transformation

slide-56
SLIDE 56
  • Prof. Leal-Taixé and Prof. Niessner

56

Image: https://distill.pub/2018/feature-wise-transformations/

  • E. Perez et al. „FiLM: Visual Reasoning with a General Conditioning Layer“. AAAI 2018.

Information coming from e.g. the other source

slide-57
SLIDE 57
  • Prof. Leal-Taixé and Prof. Niessner

57

Image: https://distill.pub/2018/feature-wise-transformations/

  • E. Perez et al. „FiLM: Visual Reasoning with a General Conditioning Layer“. AAAI 2018.
slide-58
SLIDE 58

Wh What at can an we e do do with con

  • ndi

dition

  • ning?
  • Visual Reasoning with Multi-hop Feature Modulation

Strub et al. ECCV 2018.

  • GuessWhat?! Visual object discovery through multi-modal
  • dialogue. de Vries et al CVPR 2017
  • A learned representation for artistic style.

Dumoulin et al ICLR 2017

  • Conditional image generation with PixelCNN decoders.

van den Oord et al. NIPS 2016

  • Prof. Leal-Taixé and Prof. Niessner

58

slide-59
SLIDE 59

Vi Visual al Ques estion

  • n an

and d Answer ering

  • Prof. Leal-Taixé and Prof. Niessner

59

slide-60
SLIDE 60

Atte Attenti ntion n vs vs Cond nditi tioni ning ng

  • Prof. Leal-Taixé and Prof. Niessner

60

Image: https://distill.pub/2018/feature-wise-transformations/

slide-61
SLIDE 61

Atte Attenti ntion n vs vs Cond nditi tioni ning ng

  • Attention: assumes that specific locat

ations ns contain the most useful information

  • Conditioning: assumes that specific feat

ature map aps contain the most useful information

  • Prof. Leal-Taixé and Prof. Niessner

61

Image: https://distill.pub/2018/feature-wise-transformations/

slide-62
SLIDE 62

Ne Next l lecture ure

  • No session on Friday
  • Next Monday: no lecture – CVPR break -
  • Prof. Leal-Taixé and Prof. Niessner

62