Attention Models Focus on parts of input Olof Mogren Improves NN - - PowerPoint PPT Presentation

attention models
SMART_READER_LITE
LIVE PREVIEW

Attention Models Focus on parts of input Olof Mogren Improves NN - - PowerPoint PPT Presentation

Attention Models Attention Models Focus on parts of input Olof Mogren Improves NN performance on different tasks Chalmers University of Technology IBM1 attention mechanism (1980s) Feb 2016 Attention Models Arxiv 2016


slide-1
SLIDE 1

Attention Models

Olof Mogren

Chalmers University of Technology

Feb 2016

Attention Models

  • Focus on parts of input
  • Improves NN performance on different tasks
  • IBM1 attention mechanism (1980’s)

Attention Models

  • “One of the most exciting advancements”
  • Ilya Sutskever, Dec 2015

Arxiv 2016

Multi-Way, Multilingual Neural Machine Translation with a Shared

  • ...
  • Incorporating Structural Alignment Biases into an Attentional Neural
  • ...
  • Language to Logical Form with Neural Attention

Human Attention Estimation for Natural Images: An Automatic Gaze

  • ...
  • Implicit Distortion and Fertility Models for Attention-based
  • ...
  • Survey on the attention based RNN model and its applications in
  • ...
  • From Softmax to Sparsemax: A Sparse Model of Attention and
  • ...
  • A Convolutional Attention Network for Extreme Summarization
  • ...
  • Learning Efficient Algorithms with Hierarchical Attentive Memory

Attentive Pooling Networks Attention-Based Convolutional Neural Network for Machine

  • ...
slide-2
SLIDE 2

Modelling Language using RNNs

x1 x2 x3 y2 y1 y3

  • Language models: P(wordi|word1, ..., wordi−1)
  • Recurrent Neural Networks
  • Gated additive sequence modelling:

LSTM (and variants) details

  • Fixed vector representation for sequences

Encoder-Decoder Framework

x3 x2 x1 y3 y2 y1

{

encoder

{

decoder

  • Sequence to Sequence Learning with Neural Networks Ilya

Sutskever, Oriol Vinyals, Quoc V. Le, NIPS 2014

  • Neural Machine Translation (NMT)
  • Reversed input sentence!

NMT with Attention

p(yi|y1, ..., yi−1, x) = g(yi−1, si, ci) si = f(si−1, yi−1, ci) ci = Tx

j=1 αijhj

αij =

exp(eij) Tx

k=1 exp(eik)

eij = a(si−1, hj)

x

1

x

2

x

3

x

T

+

αt,1 αt,2 αt,3 αt,T

y

t-1

y

t

h1 h2 h3 hT h1 h2 h3 hT s t-1 s t

Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau, Cho, Bengio, ICLR 2015

NMT with Attention

αij =

exp(eij) Tx

k=1 exp(eik)

eij = a(si−1, hj)

si-1 hj αt,1 αt,2 αt,T αt,3 αt,4 αt,5 αt,6 ... x

1

x

2

x

3

x

T

+

αt,1 αt,2 αt,3 αt,T

y

t-1

y

t

h1 h2 h3 hT h1 h2 h3 hT s t-1 s t

Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau, Cho, Bengio, ICLR 2015

slide-3
SLIDE 3

Alignment - (more)

The agreement

  • n

the European Economic Area was signed in August 1992 . <end> L' accord sur la zone économique européenne a été signé en août 1992 . <end>

(a)

It should be noted that the marine environment is the least known

  • f

environments . <end> Il convient de noter que l' environnement marin est le moins connu de l' environnement . <end>

(b)

Caption Generation

  • “Translating” from images to natural language

Caption Generation

  • Convolutional network:

Oxford net, 19 layers, stacks of 3x3 conv-layers, max-pooling.

  • Annotation vectors: a = {a1, ..., aL}, ai ∈ RD
  • Attention over a.

Attention Visualization

slide-4
SLIDE 4

Source Code Summarization

  • Predict function names

given function body

  • Convolutional attention

mechanism; 1D patterns

  • Out of vocabulary terms

handled (copy mechanism) details

A Convolutional Attention Network for Extreme Summarization of Source CodeAllamanis et al. Feb 2016 (arxiv draft)

Source Code Summarization

Target Attention Vectors λ m1

i s

α =

< s>{ r et ur n ( mFl ags & e Bul l et Fl ag ) = = e Bul l et Fl ag ; } < / s>

0.012 κ =

< s>{ r et ur n ( mFl ags & e Bul l et Fl ag ) = = e Bul l et Fl ag ; } < / s>

m2

bul l et

α =

< s>{ r et ur n ( mFl ags & e Bul l et Fl ag ) = = e Bul l et Fl ag ; } < / s>

0.436 κ =

< s>{ r et ur n ( mFl ags & e Bul l et Fl ag ) = = e Bul l et Fl ag ; } < / s>

m3 END α =

< s>{ r et ur n ( mFl ags & e Bul l et Fl ag ) = = e Bul l et Fl ag ; } < / s>

0.174 κ =

< s>{ r et ur n ( mFl ags & e Bul l et Fl ag ) = = e Bul l et Fl ag ; } < / s>

A Convolutional Attention Network for Extreme Summarization of Source CodeAllamanis et al. Feb 2016 (arxiv draft)

Memory Networks

  • Attention refers back to internal memory; state of encoder
  • Neural Turing Machines
  • (End-To-End) Memory Networks:

explicit memory mechanisms (out of scope today) mogren@chalmers.se http://mogren.one/ http://www.cse.chalmers.se/research/lab/

slide-5
SLIDE 5

Appendix

Teaching Machines to Read and Comprehend, Dec 2015 Hermann, Kocisky, Greffenstette, Espeholt, Kay, Suleyman, Blunsom

Draw

DRAW, A Recurrent Neural Network For Image Generation - 2015 Gregor, Danihelka, Graves, Rezende, Wierstra

Alignment - (back)

Destruction

  • f

the equipment means that Syria can no longer produce new chemical weapons . <end> La destruction de l' équipement signifie que la Syrie ne peut plus produire de nouvelles armes chimiques . <end>

(c)

" This will change my future " Cela va changer mon avenir avec ma famille " , a dit l' homme . <end>

slide-6
SLIDE 6

LSTM

Christopher Olah back

Source Code Summarization

  • Kl1: patterns in input
  • Kl2 (and Kα, Kκ): higher level abstractions
  • α, κ: attention over input subtokens
  • Simple version: only Kα, for decoding
  • Complete version: uses Kλ for deciding on

generation or copying back

A Convolutional Attention Network for Extreme Summarization of Source Code Allamanis et al. Feb 2016 (arxiv draft)

IBM Model 1: The first translation attention model!

A simple generative model for p(s|t) is derived by introducing a latent variable a into the conditional probabiliy: p(s|t) =

  • a

p(J|I) (I + 1)J

J

  • j=1

p(sj|taj), where:

  • s and t are the input (source) and output (target) sentences
  • f length J and I respectively,
  • a is a vector of length J consisting of integer indexes into the

target sentence, known as the alignment,

  • p(J|I) is not importent for training the model and we’ll treat

it as a constant ǫ. To learn this model we use the EM algorithm to find the MLE values for the parameters p(sj|taj). back

Soft vs Hard Attention

Soft

  • Weighted average of whole input
  • Differentiable loss
  • Increased computational cost

Hard

  • Sample parts of input
  • Policy gradient
  • Variational methods
  • Reinforcement Learning
  • Decreased computational cost