At Attent ntio ion The The proble lem For very long sentences, - PowerPoint PPT Presentation

At Attent ntio ion

The The proble lem • For very long sentences, the score for machine translation really goes down after 30-40 words. With attention Performance degradation Bahdanau et al 2014. Neural machine translation by jointly learning to align and translate. Prof. Leal-Taixé and Prof. Niessner 2

Bas Basic structure e of of a a RN RNN • We want to have notion of “time” or “sequence” Hidden state Previous input hidden state [Christopher Olah] Understanding LSTMs Prof. Leal-Taixé and Prof. Niessner 3

Bas Basic structure e of of a a RN RNN • We want to have notion of “time” or “sequence” Hidden state Parameters to be learned Prof. Leal-Taixé and Prof. Niessner 4

Bas Basic structure e of of a a RN RNN • We want to have notion of “time” or “sequence” Output Hidden state Same parameters for each time step = generalization! Prof. Leal-Taixé and Prof. Niessner 5

Bas Basic structure e of of a a RN RNN • Unrolling RNNs Hidden state is the same [Christopher Olah] Understanding LSTMs Prof. Leal-Taixé and Prof. Niessner 6

Bas Basic structure e of of a a RN RNN • Unrolling RNNs Prof. Leal-Taixé and Prof. Niessner 7

Lo Long ng-te term depend ndenci ncies I mo moved to Germany any … so I speak German an fluently Prof. Leal-Taixé and Prof. Niessner 8

Atte Attenti ntion: n: intu ntuiti tion ATTENTION: Which hidden states are more important to predict my output? I mo moved to Germany any … so I speak German an fluently Prof. Leal-Taixé and Prof. Niessner 9

Atte Attenti ntion: n: intu ntuiti tion Context α 1 ,t +1 α t,t +1 α t +1 ,t +1 I mo moved to Germany any … so I speak German an fluently Prof. Leal-Taixé and Prof. Niessner 10

Atte Attenti ntion: n: archi hite tectu ture • A decoder processes the information D D D • Decoders take as Context input: α t,t +1 α t +1 ,t +1 – Previous decoder hidden state – Previous output – Attention Prof. Leal-Taixé and Prof. Niessner 11

Attenti Atte ntion indicates how much the word in the position • + 1 α 1 ,t +1 is important to translate the work in position t + 1 • The context aggregates the attention t +1 X c t +1 = α k,t +1 a k k =1 ft attention: All attention masks alpha sum up to 1 • So Soft Prof. Leal-Taixé and Prof. Niessner 12

Comp Computin ing the e atten ention ion ma mask • We can train a small neural network Previous state of d t the decoder NN f 1 ,t +1 Hidden state of a 1 the encoder exp f 1 ,t +1 • Normalize α 1 ,t +1 = P t +1 k =1 exp f k,t +1 Prof. Leal-Taixé and Prof. Niessner 13

At Attent ntio ion n fo for vis visio ion

Wh Why do do we e need eed at atten ention on? • We use the whole image to make the classification BIRD • Are all pixels equally important? Prof. Leal-Taixé and Prof. Niessner 15

Wh Why do do we e need eed at atten ention on? • Wouldn’t it be easier and computationally more efficient to just run our classification network on the patch? Prof. Leal-Taixé and Prof. Niessner 16

Sof Soft t atten ttenti tion on for or ca capti tion oning

Im Image captioning Xu et al 2015. Show attention and tell: neural image caption generation with visual attention. Prof. Leal-Taixé and Prof. Niessner 18

Im Image captioning • Input: image • Output: a sentence describing the image. • Enc Encoder : a classification CNN (VGGNet, AlexNet). This computes a feature maps over the image. • De Decoder : an attention-based RNN – In each time step, the decoder computes an attention map over the entire image, effectively deciding which regions to focus on. – It receives a context vector, which is the weighted average of the conv net features. Prof. Leal-Taixé and Prof. Niessner 19

Con Conven ention ional caption ionin ing LSTM only sees the image once! Encoder Decoder Image from: https://blog.heuritech.com/2016/01/20/attention-mechanism/ Prof. Leal-Taixé and Prof. Niessner 20

Atte Attenti ntion n mecha hani nism A girl is throwing a frisbee in the park Prof. Leal-Taixé and Prof. Niessner 21

Atte Attenti ntion n mecha hani nism y " : Output of encoder are the image features which still retain spatial information (no FC layer!) Z " : Output of attention model h " : Hidden state of LSTM Prof. Leal-Taixé and Prof. Niessner 25

Atte Attenti ntion n mecha hani nism How does the attention model look like? Prof. Leal-Taixé and Prof. Niessner 26

Atte Attenti ntion n model • Attention architecture Output attention Any past hidden state Visual features Image: https://blog.heuritech.com/2016/01/20/attention-mechanism/ Prof. Leal-Taixé and Prof. Niessner 27

Atte Attenti ntion n model • Inputs = feature descriptor for each image patch Prof. Leal-Taixé and Prof. Niessner 28

Atte Attenti ntion n model • Inputs = feature descriptor for each image patch Still related to the spatial location of the image Prof. Leal-Taixé and Prof. Niessner 29

Atte Attenti ntion n model • We want an bounded output ! " = tanh ( )* + + ( -* . " Prof. Leal-Taixé and Prof. Niessner 30

Atte Attenti ntion n model • Softmax to create the attention values between 0 and 1 Prof. Leal-Taixé and Prof. Niessner 31

Atte Attenti ntion n model • Multiplied by the image features à ranking by importance Prof. Leal-Taixé and Prof. Niessner 32

Ha Hard a attentio ion mo model • Choosing one of the features by sampling with probabilities s i Prof. Leal-Taixé and Prof. Niessner 33

Ty Types of atte ttenti ntion • Soft at attent ntion : deterministic process that can be backproped • Har ard at attent ntion : stochastic process, gradient is estimated through Monte Carlo sampling. • Soft attention is the most commonly used since it can be incorporated into the optimization more easily Prof. Leal-Taixé and Prof. Niessner 34

Ty Types of atte ttenti ntion • Soft vs hard attention Soft Hard Prof. Leal-Taixé and Prof. Niessner 35

Ty Types of atte ttenti ntion: n: soft Final context Attention Att Can be backproped • module Uses all the image • Image: Stanford CS231n lecture Prof. Leal-Taixé and Prof. Niessner 36

Ty Types of atte ttenti ntion: n: ha hard You can view it as an • image cropping! If we cannot use • gradient descent, what alternative could we use to train this function? Reinforcement Learning Image: Stanford CS231n lecture Prof. Leal-Taixé and Prof. Niessner 37

Im Image captioning with attention Xu et al 2015. Show attention and tell: neural image caption generation with visual attention. Prof. Leal-Taixé and Prof. Niessner 38

In Interesting works on attention Luong et al, “Effective Approaches to Attentionbased Neural Machine • Translation,” EMNLP 2015 Chan et al, “Listen, Attend, and Spell”, arXiv 2015 • Chorowski et al, “Attention-based models for Speech Recognition”, NIPS • 2015 Yao et al, “Describing Videos by Exploiting Temporal Structure”, ICCV • 2015 Xu and Saenko, “Ask, Attend and Answer: Exploring Question-Guided • Spatial Attention for Visual Question Answering”, arXiv 2015 Zhu et al, “Visual7W: Grounded Question Answering in Images”, arXiv • 2015 Chu et al. „Online Multi-Object Tracking Using CNN-based Single Object • Tracker with Spatial-Temporal Attention Mechanism“. ICCV 2017 Prof. Leal-Taixé and Prof. Niessner 39

Con Condi diti tion oning

Wh When en do do we e need eed con ondi dition oning? • Scene understanding from an image and an audio source. Both need to be processed! Prof. Leal-Taixé and Prof. Niessner 41

Wh When en do do we e need eed con ondi dition oning? • Visual Question and Answering: the sentence (question) needs to be understood, the image is needed to create the answer. Prof. Leal-Taixé and Prof. Niessner 42

Wh When en do do we e need eed con ondi dition oning? • Visual Question and Answering: the sentence (question) needs to be understood, the image is needed to create the answer. Prof. Leal-Taixé and Prof. Niessner 43

Wh When en do do we e need eed con ondi dition oning? • We have two sources, can we process one in n the he cont ntext of the other? • Cond nditioni ning ng : the computation carried out by a model is conditioned or modulated by information extracted from an auxiliary input. • Note: a similar thing can be obtained with attention (see p. 39) Prof. Leal-Taixé and Prof. Niessner 44

Wh When en do do we e need eed con ondi dition oning? • Generate images based on a word • Do we need to retrain a model for each word? Image: https://distill.pub/2018/feature-wise-transformations/ Prof. Leal-Taixé and Prof. Niessner 45

At Attent ntio ion The The proble lem For very long sentences, - PowerPoint PPT Presentation

At Attent ntio ion The The proble lem For very long sentences, the score for machine translation really goes down after 30-40 words. With attention Performance degradation Bahdanau et al 2014. Neural machine translation by jointly

The new 9510 District Why t the NEW D Dis istric ict the p proble lem? RI

Level II Le II Sales C Compa mparison C n Class Proble lem m # 1 Compa mparative A

Veggie ggie W Washer r Pro Proble lem: Ineffic icie ient V Vege getable le C Cle

European Bank for Reconstruction and Development Thank nk you for your r at attent entio ion

Ca Capturing pturing Media Media Attent Attention ion Audio is only available by conference

LEM PRODUCTION AND QA/QC PREPARATION FOR ANODE PROCUREMENT BY CEA/IRFU A. Delbart, Ph. Cotte, M.

Anode impedance and grid/LEM capacitance measurements Caspar Schloesser 1 Summary Impedance

I nt roduct ion t o Lab 2 I nt roduct ion t o Lab 2 I nt roduct ion t o Lab 2 I nt roduct ion t

T OF D Who we ar e Co nve ntio na l a nd Oil a nd Ga s, Re fine ry, Ove r 400 pe rso nne l

RVI IO N SERVIC ES DAC O N INSPEC T INSPE CT ION Who we ar e Co nve ntio na l a nd Oil a

C- SCAN Who we ar e Co nve ntio na l a nd Oil a nd Ga s, Re fine ry, Ove r 400 pe rso nne l

T OF D Who we ar e Co nve ntio na l a nd Oil a nd Ga s, Re fine ry, Ove r 400 pe rso nne l

AUT Who we ar e Co nve ntio na l a nd Oil a nd Ga s, Re fine ry, Ove r 400 pe rso nne l T

Meridan State College Parent Partnership Meeting 2020 Primary School Priorities Englis

CPx Communication Prescription LLC Ob je c tive s Will NOT solve your proble ms De sc

TH THE ECOLOGICAL L PROBLE LEMS OF AVOLA LA Avola is a Sicilian city. It is located in

Building Community Between Police and Youth Wednesday, August 16, 2017 Housekeeping Question or

On the Level of Teaching Heaven Teaching Principles Earth 1 Four Principles Stepwise

CS 31: Introduction to Computer Systems August 30 th 2016 Course Staff Professor: Bryce

Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model Aras Atalar, Paul

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and

Attention-based Model and It Its Application in in Scene Text xt Recognition Baoguang Shi (

DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION Thursday 4th October, 2018 Martin Engilberge,

Java Decision Making and booleans (Java: An Eventful Approach, Ch 4), 26 October 2012 Slides