attention models
play

Attention Models Focus on parts of input Olof Mogren Improves NN - PowerPoint PPT Presentation

Attention Models Attention Models Focus on parts of input Olof Mogren Improves NN performance on different tasks Chalmers University of Technology IBM1 attention mechanism (1980s) Feb 2016 Attention Models Arxiv 2016


  1. Attention Models Attention Models • Focus on parts of input Olof Mogren • Improves NN performance on different tasks Chalmers University of Technology • IBM1 attention mechanism (1980’s) Feb 2016 Attention Models Arxiv 2016 � � Multi-Way, Multilingual Neural Machine Translation with a Shared ... � � Incorporating Structural Alignment Biases into an Attentional Neural ... Language to Logical Form with Neural Attention � � Human Attention Estimation for Natural Images: An Automatic Gaze ... • “One of the most exciting advancements” � � Implicit Distortion and Fertility Models for Attention-based ... - Ilya Sutskever, Dec 2015 � � Survey on the attention based RNN model and its applications in ... � � From Softmax to Sparsemax: A Sparse Model of Attention and ... � � A Convolutional Attention Network for Extreme Summarization ... Learning Efficient Algorithms with Hierarchical Attentive Memory Attentive Pooling Networks � � Attention-Based Convolutional Neural Network for Machine ...

  2. Modelling Language using RNNs Encoder-Decoder Framework y 1 y 2 y 3 y 1 y 2 y 3 encoder { { decoder x 1 x 2 x 3 x 1 x 2 x 3 • Language models: P ( word i | word 1 , ..., word i − 1 ) • Recurrent Neural Networks • Sequence to Sequence Learning with Neural Networks Ilya • Gated additive sequence modelling: Sutskever, Oriol Vinyals, Quoc V. Le, NIPS 2014 LSTM (and variants) details • Neural Machine Translation (NMT) • Fixed vector representation for sequences • Reversed input sentence! NMT with Attention NMT with Attention exp( e ij ) α ij = � Tx k =1 exp( e ik ) y y p ( y i | y 1 , ..., y i − 1 , x ) = g ( y i − 1 , s i , c i ) y y t-1 t e ij = a ( s i − 1 , h j ) t-1 t s t-1 s t s t-1 s t α t,5 s i = f ( s i − 1 , y i − 1 , c i ) + α t,1 + α t,3 α t,T α t,6 ... α t,1 α t,T α t,2 α t,4 α t,1 α t,T α t,2 α t,3 c i = � T x α t,2 j =1 α ij h j α t,3 h 1 h 2 h 3 h T h 1 h 2 h 3 h T exp( e ij ) α ij = h 1 h 2 h 3 h T � Tx h 1 h 2 h 3 h T k =1 exp( e ik ) x x x x x x x x 1 2 3 T 1 2 3 T e ij = a ( s i − 1 , h j ) si-1 hj Neural Machine Translation by Jointly Learning to Align and Translate Bahdanau, Cho, Bengio, ICLR 2015 Neural Machine Translation by Jointly Learning to Align and Translate Bahdanau, Cho, Bengio, ICLR 2015

  3. Alignment - (more) Caption Generation agreement environments environment European Economic <end> August signed 1992 <end> Area marine should known was The the noted least on in that the the . be of It is L' . Il accord convient sur de la noter que zone l' économique environnement européenne marin a est été le moins signé connu en de août l' • “Translating” from images to natural language 1992 environnement . . <end> <end> (a) (b) Caption Generation Attention Visualization • Convolutional network: Oxford net, 19 layers, stacks of 3x3 conv-layers, max-pooling. • Annotation vectors: a = { a 1 , ..., a L } , a i ∈ R D • Attention over a .

  4. Source Code Summarization Source Code Summarization Target Attention Vectors λ • Predict function names α = < s>{ r et ur n ( mFl ags & e Bul l et Fl ag ) = = e Bul l et Fl ag ; } < / s> i s 0.012 m 1 given function body κ = < s>{ r et ur n ( mFl ags & e Bul l et Fl ag ) = = e Bul l et Fl ag ; } < / s> α = • Convolutional attention < s>{ r et ur n ( mFl ags & e Bul l et Fl ag ) = = e Bul l et Fl ag ; } < / s> bul l et 0.436 m 2 κ = < s>{ r et ur n ( mFl ags & e Bul l et Fl ag ) = = e Bul l et Fl ag ; } < / s> mechanism; 1D patterns α = < s>{ r et ur n ( mFl ags & e Bul l et Fl ag ) = = e Bul l et Fl ag ; } < / s> E ND 0.174 m 3 • Out of vocabulary terms κ = < s>{ r et ur n ( mFl ags & e Bul l et Fl ag ) = = e Bul l et Fl ag ; } < / s> handled (copy mechanism) details A Convolutional Attention Network for Extreme Summarization of Source Code Allamanis et al. Feb 2016 (arxiv draft) A Convolutional Attention Network for Extreme Summarization of Source Code Allamanis et al. Feb 2016 (arxiv draft) Memory Networks mogren@chalmers.se http://mogren.one/ • Attention refers back to internal memory; state of encoder • Neural Turing Machines • (End-To-End) Memory Networks: explicit memory mechanisms http://www.cse.chalmers.se/research/lab/ (out of scope today)

  5. Appendix Teaching Machines to Read and Comprehend, Dec 2015 Hermann, Kocisky, Greffenstette, Espeholt, Kay, Suleyman, Blunsom Alignment - (back) Draw Destruction equipment change chemical future weapons produce <end> This means longer will my Syria new that can " the no of " . La Cela destruction va de changer l' mon équipement avenir signifie que avec la ma Syrie famille ne " peut , plus a produire dit de l' nouvelles armes homme chimiques . . <end> DRAW, A Recurrent Neural Network For Image Generation - 2015 <end> Gregor, Danihelka, Graves, Rezende, Wierstra (c)

  6. LSTM Source Code Summarization • K l 1 : patterns in input • K l 2 (and K α , K κ ): higher level abstractions • α , κ : attention over input subtokens • Simple version: only K α , for decoding • Complete version: uses K λ for deciding on generation or copying back Christopher Olah A Convolutional Attention Network for Extreme Summarization of Source Code Allamanis et al. Feb 2016 (arxiv draft) back IBM Model 1: The first translation attention model! Soft vs Hard Attention A simple generative model for p ( s | t ) is derived by introducing a latent variable a into the conditional probabiliy: Soft J p ( J | I ) � � p ( s | t ) = p ( s j | t a j ) , • Weighted average of whole input ( I + 1) J a j =1 • Differentiable loss • Increased computational cost where: • s and t are the input (source) and output (target) sentences Hard of length J and I respectively, • Sample parts of input • a is a vector of length J consisting of integer indexes into the • Policy gradient target sentence, known as the alignment, • Variational methods • p ( J | I ) is not importent for training the model and we’ll treat • Reinforcement Learning it as a constant ǫ . • Decreased computational cost To learn this model we use the EM algorithm to find the MLE values for the parameters p ( s j | t a j ). back

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend