CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem - PowerPoint PPT Presentation

Sherlock Holmes’ mind palace, BBC/Masterpiece's Sherlock CMP784 DEEP LEARNING Lecture #08 – Attention and Memory Aykut Erdem // Hacettepe University // Spring 2018

Breaking news! • Practical 2 is due April 6, 23:59 • Midterm exam in class next week (April 12) − Check the midterm guide for details • Practical 3 will be out tomorrow! ! − Language modeling with RNNs 7 − Due Sunday, April 22, 23:59 2

Using RNNs to generate Super Mario Maker levels, Adam Geitgey Previously on CMP784 • Sequence modeling • Recurrent Neural Networks (RNNs) • The Vanilla RNN unit • How to train RNNs • The Long Short-Term Memory (LSTM) unit and its variants • Gated Recurrent Unit (GRU) image: Oleg Soroko 3

Lecture overview • Attention Mechanism for Deep Learning • Attention for Image Captioning • Memory Networks • End-to-end Memory Networks • Dynamic Memory Networks Di Disclaimer: Much of the material and slides for this lecture were borrowed from — Mateusz Malinowski’s lecture on Attention-based Networks — Graham Neubig’s CMU CS11-747 Neural Networks for NLP class — Chris Dyer’s Oxford Deep NLP class — Yoshua Bengio’s talk on From Attention to Memory and towards Longer-Term Dependencies — Sumit Chopra’s lecture on Reasoning, Attention and Memory — Jason Weston’s tutorial on Memory Networks for Language Understanding — Richard Socher’s talk on Dynamic Memory Networks 4

Deep Learning for Vision Deep Learning for Vision −U− Figur −U− Figure credit: Xiaogang Wang 5

Deep Learning for Speech Figure credit: NVidia 6

Deep Learning for Text positive ˆ Y W 3 z 21 z 22 z 23 z 24 z 25 W 2 z 13 z 15 z 16 z 11 z 12 z 14 W 1 x 1 x 2 x 3 x 4 x 5 “The movie was not bad at all. I had fun.” 7

Deep Models Deep Models Loss Function Typically a Linear Projection G W 2 Typically a Linear Pr with some non-linearity with some non-linearity Classifier/Regressor (log-soft-max) (decoder) Fully Connected Network Fully Connected Network can be seen as F W 1 a prior on the type of Convolution Network a prior on the type of Feature Extractor transformation you want mation you want (encoder) Recurrent Network Input Representation “The movie was not bad at all. I had fun.” 8

Deep Models Deep Models Loss Function Typically a Linear Projection G W 2 Typically a Linear Pr with some non-linearity with some non-linearity Classifier/Regressor (log-soft-max) (decoder) Learnable parametric function Inputs: generally considered I.I.D. Fully Connected Network Fully Connected Network can be seen as Outputs: classification or regression F W 1 a prior on the type of Convolution Network a prior on the type of Feature Extractor transformation you want mation you want (encoder) Recurrent Network Input Representation “The movie was not bad at all. I had fun.” 9

Encoder-Decoder Framework • Intermediate representation of meaning = ‘universal representation’ • Encoder: from word sequence to sentence representation • Decoder: from representation to word sequence distribution Decoder y T' y 2 y 1 English sentence English sentence For unilingual data English English For bitext data decoder decoder c French English encoder encoder x 1 x 2 x T French sentence English sentence Encoder 10

Sentence Representations • But what if we could use multiple vectors, based on the length of the sentence this is an example this is an example 11

Attention 12

Basic Idea • Encode each word in the sentence into a vector • When decoding, perform a linear combination of these vectors, weighted by “attention weights” ( wh where to look ) • Use this combination in picking the next item 13

Calculating Attention query vector (decoder state) and ke key vectors (all encoder states) • Use quer • For each query-key kono eiga ga kirai pair, calculate Key weight Vectors • Normalize to add I hate to one using softmax a 1 =2.1 a 2 =-0.1 a 3 =0.3 a 4 =-1.0 Query Vector softmax α 1 =0.76 α 2 =0.08 α 3 =0.13 α 4 =0.03 14

Calculating Attention • Combine together value vectors (usually encoder states, like key vectors) by taking the weighted sum kono eiga ga kirai Value Vectors * * * * α 1 =0.76 α 2 =0.08 α 3 =0.13 α 4 =0.03 • Use this in any part of the model you like 15

A Graphical Example (Bahdanau et al 2014, Jean et al 2014, Gulcehre et al 2015, Jean et al 2015) 16

End-to-End Machine Translation with Recurrent Nets and Attention Mechanism (Bahdanau et al 2014, Jean et al 2014, Gulcehre et al 2015, Jean et al 2015) Phrase-based SMT Syntax-based SMT Neural MT 25 20 15 10 Phrase-based SMT 5 25 SMT Syntax-based SMT SMT Neural MT 0 2013 2014 2015 2016 Figure credit: Rico Sennrich 17

Attention Score Functions • q is the query and k is the key • Do Dot Pr Product ct (Luong et al. 2015) • Mu Multi-la layer er Pe Perce ceptron (Bahdanau et al. 2015) a ( q , k ) = q | k a ( q , k ) = w | 2 tanh( W 1 [ q ; k ]) − No parameters! But requires sizes to be the same. − Flexible, often very good with large • Scal Scaled Do Dot Pr Product ct (Vaswani et al. 2017) data − Problem: scale of dot product • Bilinea Bilinear (Luong et al. 2015) increases as dimensions get • larger a ( q , k ) = q | W k − Fix: scale by size of the vector a ( q , k ) = q | k p | k | 18

Case Study: Show, Attend and Tell Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, Y. Bengio. ICML 2015 19

Paying Attention to Selected Parts of the Image While Uttering Words 20

�� Akiko likes Pimm’s </s> ∼ ∼ ∼ ∼ ˆ p 1 softmax softmax softmax softmax h 4 h 1 h 2 h 3 x 1 x 2 x 3 x 4 <s> Sutskever et al. (2014) Sutskever et al. (2014) 21

a man is rowing ∼ ∼ ∼ ∼ ˆ p 1 softmax softmax softmax softmax h 4 h 1 h 2 h 3 x 1 x 2 x 3 x 4 <s> Vinyals et al. (2014) Show and Tell: A Neural Image Caption Generator Vinyals et al. (2014) Show and Tell: A Neural Image Caption Generator 22

Regions in ConvNets Each point in a “higher” level of a convnet   • Each point in a “higher” level of a convnet defines spatially localized feature vectors(/matrices). Xu et al. calls these “annotation vectors”, a i , i ∈ { 1 , . . . , L } • Xu et al. calls these “annotation vectors”, 23

Regions in ConvNets a 1 h i F = a 1 24

Regions in ConvNets 25

Regions in ConvNets 26

Extension of LSTM via the context vector • Extract L D-dimensional annotations − Lower convolutional layer to have the correspondence between the feature vectors and portions of the 2-D image 0 1 0 1 i t σ 0 1 E : embedding matrix Ey t − 1 f t σ E - embeddin B C B C A = A T D + m + n,n h t − 1 (1) y : captions B C B C @ A σ o t y - captions @ @ ˆ z t h : previous hidden state tanh h - previous h g t z : context vector, a dynamic representation z - context ve c t = f t � c t − 1 + i t � g t (2) representation of the relevant part of the image input at time t h t = o t � tanh( c t ) . part of the ima (3) e ti = f att ( a i , h t − 1 ) A MLP conditioned on exp( e ti ) the previous hidden state α ti = . P L k =1 exp( e tk ) is the ‘attention’ (‘focus’) function – ‘soft’ / ’hard’ of φ function ˆ z t = φ ( { a i } , { α i } ) is the ‘attention’ (‘focus’) fun p ( y t | a , y t − 1 ) / exp( L o ( Ey t − 1 + L h h t + L z ˆ z t )) 1 27

Hard attention e ti = f att ( a i , h t − 1 ) We have two sequences ‘I’ that runs over localizations ˆ z t = φ ( { a i } , { α i } ) exp( e ti ) ‘t’ that runs over words α ti = . P L Stochastic decisions are discrete k =1 exp( e tk ) here, so derivatives are zero Loss is a variational lower bound on p ( s t,i = 1 | s j<t , a ) = α t,i the marginal log-likelihood the marginal log-likelihood  X ˆ X z t = s t,i a i . L s = p ( s | a ) log p ( y | s, a ) s i X  log p ( s | a ) p ( y | s, a )  ∂ log p ( y | s, a ) � ∂ L s log p ( y | s, a ) ∂ log p ( s | a ) X ∂ W = p ( s | a ) + . s ∂ W ∂ W s = log p ( y | a ) E [log( X )] ≤ lo Due to Jensen’s inequality E [log( X )] ≤ log( E [ X ]) Due to Jensen’s inequality lity s t ∼ Multinoulli L ( { α i } ) ˜ N s n , a ) ∂ W ≈ 1  ∂ log p ( y | ˜ ∂ L s X + N N ∂ W s n , a ) ∂ W ≈ 1  ∂ log p ( y | ˜ ∂ L s X n =1 + s n | a ) N ∂ W s n ] s n , a ) − b ) ∂ log p (˜ ∂ H [˜ � n =1 λ r (log p ( y | ˜ + λ e s n | a ) s n , a ) ∂ log p (˜ � ∂ W ∂ W log p ( y | ˜ To reduce the estimator variance, entropy term H[s] and bias are added [1,2] To reduce the estimator variance, entropy term H[s] ∂ W [1] J. Ba et. al. “Multiple object recognition w [1] J. Ba et al. “Multiple object recognition with visual attention” [2] A. Mnih et al. “Neural variational inference and learning in belief networks” i } by 28 ∂ function W their returns a

CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem - PowerPoint PPT Presentation

Sherlock Holmes mind palace, BBC/Masterpiece's Sherlock CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem // Hacettepe University // Spring 2018 Breaking news! Practical 2 is due April 6, 23:59 Midterm exam in

CMP784 DEEP LEARNING Lecture #12 Deep Reinforcement Learning Aykut Erdem // Hacettepe

CMP784 DEEP LEARNING Lecture #11 Variational Autoencoders Aykut Erdem // Hacettepe

CMP784 DEEP LEARNING Lecture #12 Self-Supervised Learning Aykut Erdem // Hacettepe

CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut Erdem // Hacettepe University

CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem // Hacettepe University //

CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut Erdem // Hacettepe University

Two Week Fall Sale! EARN With The 2X SAVE EVEN MORE Purchase Of This O N T E B A

CSE 370 Lecture 3 Wrapping up 2s Complement. Starting Boolean Algebra. Lecture 2 Recap:

Supporting Supporting Cal California rnias English Learner English Learners Leni Wolf

TELEPHONE CALLS & CONNECTIONS IN THE FIXED NETWORK ETI2506 Monday, 24 October 2016 1

1 WIS 2 WIS 3 WIS 4 WIS 5 WIS Funny! Good storyline! Fantastic

Sinetris Tetris played with the Kinect Daniel Fritsch, Sarah Weinmann Projectmotivation

Corina Stretch Puget Sound Energy Headquartered in Bellevue, Washington Currently, an

MA111: Contemporary mathematics Exam is Nov 19. HW is due Nov 19. Context: How to share? The

CSC 411: Machine Learning in Action Challenge : Movie Rating and Genre Prediction Sanja Fidler

Annotating tense in a tenseless language Nianwen Xue, Hua Zhong and Kai-Yun Chen

fg=? f g ? f g ? fg=? original image h filtered original image h filtered Filters for

Computer Science in the Real World Russell Feldhausen (@russfeld) Have a Byte 2019

8.6.20 1 English Term 6 Week 2.notebook June 06, 2020 8.6.20 2 English Term 6 Week 2.notebook

High Performance for Small Sites John Bickar, Stanford Web

Settj ttjng ng the the s stage Bruce Becker: Coordinator, Africa-Arabia ROC |

Write Your C Extension for Ruby (Jian Weihang) @tonytonyjan Bonjour Jian,

Draw me a L ocal K ernel D ebugger Demo Conclusion Samuel Chevet & Clment Rouault 20

Wrap-up Part 1 Web IE, Wrappers and Information Integration using Karma Extracting Data from

CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem - PowerPoint PPT Presentation

Sherlock Holmes mind palace, BBC/Masterpiece's Sherlock CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem // Hacettepe University // Spring 2018 Breaking news! Practical 2 is due April 6, 23:59 Midterm exam in

CMP784 DEEP LEARNING Lecture #12 Deep Reinforcement Learning Aykut Erdem // Hacettepe

CMP784 DEEP LEARNING Lecture #11 Variational Autoencoders Aykut Erdem // Hacettepe

CMP784 DEEP LEARNING Lecture #12 Self-Supervised Learning Aykut Erdem // Hacettepe

CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut Erdem // Hacettepe University

CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem // Hacettepe University //

CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut Erdem // Hacettepe University

Two Week Fall Sale! EARN With The 2X SAVE EVEN MORE Purchase Of This O N T E B A

CSE 370 Lecture 3 Wrapping up 2s Complement. Starting Boolean Algebra. Lecture 2 Recap:

Supporting Supporting Cal California rnias English Learner English Learners Leni Wolf

TELEPHONE CALLS &amp; CONNECTIONS IN THE FIXED NETWORK ETI2506 Monday, 24 October 2016 1

1 WIS 2 WIS 3 WIS 4 WIS 5 WIS Funny! Good storyline! Fantastic

Sinetris Tetris played with the Kinect Daniel Fritsch, Sarah Weinmann Projectmotivation

Corina Stretch Puget Sound Energy Headquartered in Bellevue, Washington Currently, an

MA111: Contemporary mathematics Exam is Nov 19. HW is due Nov 19. Context: How to share? The

CSC 411: Machine Learning in Action Challenge : Movie Rating and Genre Prediction Sanja Fidler

Annotating tense in a tenseless language Nianwen Xue, Hua Zhong and Kai-Yun Chen

f*g=? f g ? f g ? f*g=? original image h filtered original image h filtered Filters for

Computer Science in the Real World Russell Feldhausen (@russfeld) Have a Byte 2019

8.6.20 1 English Term 6 Week 2.notebook June 06, 2020 8.6.20 2 English Term 6 Week 2.notebook

High Performance for Small Sites John Bickar, Stanford Web

Settj ttjng ng the the s stage Bruce Becker: Coordinator, Africa-Arabia ROC |

Write Your C Extension for Ruby (Jian Weihang) @tonytonyjan Bonjour Jian,

Draw me a L ocal K ernel D ebugger Demo Conclusion Samuel Chevet &amp; Clment Rouault 20

Wrap-up Part 1 Web IE, Wrappers and Information Integration using Karma Extracting Data from

TELEPHONE CALLS & CONNECTIONS IN THE FIXED NETWORK ETI2506 Monday, 24 October 2016 1

fg=? f g ? f g ? fg=? original image h filtered original image h filtered Filters for

Draw me a L ocal K ernel D ebugger Demo Conclusion Samuel Chevet & Clment Rouault 20