CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem - - PowerPoint PPT Presentation

cmp784
SMART_READER_LITE
LIVE PREVIEW

CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem - - PowerPoint PPT Presentation

Sherlock Holmes mind palace, BBC/Masterpiece's Sherlock CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem // Hacettepe University // Spring 2018 Breaking news! Practical 2 is due April 6, 23:59 Midterm exam in


slide-1
SLIDE 1

Lecture #08 – Attention and Memory

Aykut Erdem // Hacettepe University // Spring 2018

CMP784

DEEP LEARNING

Sherlock Holmes’ mind palace, BBC/Masterpiece's Sherlock

slide-2
SLIDE 2

Breaking news!

  • Practical 2 is due April 6, 23:59
  • Midterm exam in class next week (April 12)

− Check the midterm guide for details

  • Practical 3 will be out tomorrow!

− Language modeling with RNNs − Due Sunday, April 22, 23:59

2

7

!

slide-3
SLIDE 3

Previously on CMP784

  • Sequence modeling
  • Recurrent Neural Networks

(RNNs)

  • The Vanilla RNN unit
  • How to train RNNs
  • The Long Short-Term Memory

(LSTM) unit and its variants

  • Gated Recurrent Unit (GRU)

image: Oleg Soroko3 Using RNNs to generate Super Mario Maker levels, Adam Geitgey

slide-4
SLIDE 4

Lecture overview

  • Attention Mechanism for Deep Learning
  • Attention for Image Captioning
  • Memory Networks
  • End-to-end Memory Networks
  • Dynamic Memory Networks

Di Disclaimer: Much of the material and slides for this lecture were borrowed from

— Mateusz Malinowski’s lecture on Attention-based Networks — Graham Neubig’s CMU CS11-747 Neural Networks for NLP class — Chris Dyer’s Oxford Deep NLP class — Yoshua Bengio’s talk on From Attention to Memory and towards Longer-Term Dependencies — Sumit Chopra’s lecture on Reasoning, Attention and Memory — Jason Weston’s tutorial on Memory Networks for Language Understanding — Richard Socher’s talk on Dynamic Memory Networks

4
slide-5
SLIDE 5

Deep Learning for Vision

5

Figure credit: Xiaogang Wang

Deep Learning for Vision

−U−

−U−

Figur

slide-6
SLIDE 6

Deep Learning for Speech

6

Figure credit: NVidia

slide-7
SLIDE 7

Deep Learning for Text

7

x1 x2 x3 x4 x5 z11 z12 z13 z14 z15 z16 z21 z22 z23 z24 z25 ˆ Y W1 W2 W3

positive “The movie was not bad at all. I had fun.”

slide-8
SLIDE 8

Deep Models

8

Deep Models

Input Representation Feature Extractor (encoder) Classifier/Regressor (decoder)

GW2 FW1

Loss Function Fully Connected Network Typically a Linear Pr with some non-linearity a prior on the type of mation you want

“The movie was not bad at all. I had fun.”

can be seen as a prior on the type of transformation you want Fully Connected Network Convolution Network Recurrent Network Typically a Linear Projection with some non-linearity (log-soft-max)

slide-9
SLIDE 9

Deep Models

9

Deep Models

Input Representation Feature Extractor (encoder) Classifier/Regressor (decoder)

GW2 FW1

Loss Function Fully Connected Network Typically a Linear Pr with some non-linearity a prior on the type of mation you want

“The movie was not bad at all. I had fun.”

can be seen as a prior on the type of transformation you want Fully Connected Network Convolution Network Recurrent Network Typically a Linear Projection with some non-linearity (log-soft-max)

Learnable parametric function Inputs: generally considered I.I.D. Outputs: classification or regression

slide-10
SLIDE 10

Encoder-Decoder Framework

  • Intermediate representation of meaning

= ‘universal representation’

  • Encoder: from word sequence to sentence representation
  • Decoder: from representation to word sequence distribution
10

x1 x2 xT

yT' y2 y1

c

Decoder Encoder

French encoder

English decoder French sentence English sentence

English encoder

English decoder English sentence English sentence For bitext data For unilingual data

slide-11
SLIDE 11

Sentence Representations

  • But what if we could use multiple vectors, based on the length of

the sentence

11

this is an example this is an example

slide-12
SLIDE 12

Attention

12
slide-13
SLIDE 13

Basic Idea

  • Encode each word in the sentence into a vector
  • When decoding, perform a linear combination of these vectors,

weighted by “attention weights” (wh where to look)

  • Use this combination in picking the next item
13
slide-14
SLIDE 14

kono eiga ga kirai Key Vectors I hate Query Vector a1=2.1 a2=-0.1 a3=0.3 a4=-1.0

softmax

α1=0.76 α2=0.08 α3=0.13 α4=0.03

Calculating Attention

  • Use quer

query vector (decoder state) and ke key vectors (all encoder states)

  • For each query-key

pair, calculate weight

  • Normalize to add

to one using softmax

14
slide-15
SLIDE 15

Calculating Attention

  • Combine together value vectors (usually encoder states, like key

vectors) by taking the weighted sum

  • Use this in any part of the model you like
15

kono eiga ga kirai Value Vectors α1=0.76 α2=0.08 α3=0.13 α4=0.03 * * * *

slide-16
SLIDE 16

A Graphical Example

16

(Bahdanau et al 2014, Jean et al 2014, Gulcehre et al 2015, Jean et al 2015)

slide-17
SLIDE 17

5 10 15 20 25 2013 2014 2015 2016 Phrase-based SMT Syntax-based SMT Neural MT

End-to-End Machine Translation with Recurrent Nets and Attention Mechanism

17

(Bahdanau et al 2014, Jean et al 2014, Gulcehre et al 2015, Jean et al 2015)

Figure credit: Rico Sennrich

25 Phrase-based SMT SMT Syntax-based SMT SMT Neural MT

slide-18
SLIDE 18

a(q, k) = q|k p |k| a(q, k) = w|

2tanh(W1[q; k])

a(q, k) = q|Wk a(q, k) = q|k

Attention Score Functions

  • q is the query and k is the key
  • Mu

Multi-la layer er Pe Perce ceptron

(Bahdanau et al. 2015)

− Flexible, often very good with large data

  • Bilinea

Bilinear (Luong et al. 2015)

  • Do

Dot Pr Product ct (Luong et al. 2015)

− No parameters! But requires sizes to be the same.

  • Scal

Scaled Do Dot Pr Product ct (Vaswani et al. 2017)

− Problem: scale of dot product increases as dimensions get • larger − Fix: scale by size of the vector

18
slide-19
SLIDE 19

Case Study: Show, Attend and Tell

19

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov,

  • R. Zemel, Y. Bengio. ICML 2015
slide-20
SLIDE 20

Paying Attention to Selected Parts

  • f the Image

While Uttering Words

20
slide-21
SLIDE 21 21 softmax

ˆ p1 h1 x1

<s>

Akiko

h2

softmax

x2

likes

x3 h3

softmax

Pimm’s

x4 h4

softmax

</s>

  • Sutskever et al. (2014)

Sutskever et al. (2014)

slide-22
SLIDE 22 22 softmax

ˆ p1 h1 x1

<s>

a

h2

softmax

x2

man

x3 h3

softmax

is

x4 h4

softmax

rowing

Vinyals et al. (2014) Show and Tell: A Neural Image Caption Generator Vinyals et al. (2014) Show and Tell: A Neural Image Caption Generator

slide-23
SLIDE 23

Regions in ConvNets

  • Each point in a “higher” level of a convnet defines spatially localized

feature vectors(/matrices).

  • Xu et al. calls these “annotation vectors”,
23

Each point in a “higher” level of a convnet 
 Xu et al. calls these “annotation vectors”, ai, i ∈ {1, . . . , L}

slide-24
SLIDE 24

Regions in ConvNets

24

a1 a1

h

i

F =

slide-25
SLIDE 25

Regions in ConvNets

25
slide-26
SLIDE 26

Regions in ConvNets

26
slide-27
SLIDE 27

Extension of LSTM via the context vector

  • Extract L D-dimensional annotations

− Lower convolutional layer to have the correspondence between the feature vectors and portions of the 2-D image

27

E - embeddin y - captions h - previous h z - context ve representation part of the ima

B B @ it ft

  • t

gt 1 C C A = B B @ σ σ σ tanh 1 C C A TD+m+n,n @ Eyt−1 ht−1 ˆ zt 1 A (1) ct = ft ct−1 + it gt (2) ht = ot tanh(ct). (3)

eti =fatt(ai, ht−1) αti = exp(eti) PL

k=1 exp(etk)

.

ˆ zt = φ ({ai} , {αi})

  • f φ function

is the ‘attention’ (‘focus’) fun p(yt|a, yt−1

1

) / exp(Lo(Eyt−1 + Lhht + Lzˆ zt))

E: embedding matrix y: captions h: previous hidden state z: context vector, a dynamic representation

  • f the relevant part of the image input at time t

is the ‘attention’ (‘focus’) function – ‘soft’ / ’hard’

A MLP conditioned on the previous hidden state

slide-28
SLIDE 28

Hard attention

28

We have two sequences ‘I’ that runs over localizations ‘t’ that runs over words Stochastic decisions are discrete here, so derivatives are zero

ˆ zt = φ ({ai} , {αi})

eti =fatt(ai, ht−1) αti = exp(eti) PL

k=1 exp(etk)

. by

Loss is a variational lower bound on the marginal log-likelihood Due to Jensen’s inequality

Ls = X

s

p(s | a) log p(y | s, a)  log X

s

p(s | a)p(y | s, a) = log p(y | a)

the marginal log-likelihood

their

E[log(X)]≤ lo

Due to Jensen’s inequality

p(st,i = 1 | sj<t, a) = αt,i ˆ zt = X

i

st,iai.

∂Ls ∂W = X

s

p(s | a) ∂ log p(y | s, a) ∂W +

function

 log p(y | s, a)∂ log p(s | a) ∂W

  • .

i}

E[log(X)]≤ log(E[X])

lity

[1] J. Ba et al. “Multiple object recognition with visual attention” [2] A. Mnih et al. “Neural variational inference and learning in belief networks”

∂Ls ∂W ≈ 1 N

N

X

n=1

∂ log p(y | ˜ sn, a) ∂W + λr(log p(y | ˜ sn, a) − b)∂ log p(˜ sn | a) ∂W + λe ∂H[˜ sn] ∂W

  • To reduce the estimator variance, entropy term H[s]

W

returns a

∂Ls ∂W ≈ 1 N

N

X

n=1

∂ log p(y | ˜ sn, a) ∂W + log p(y | ˜ sn, a)∂ log p(˜ sn | a) ∂W

  • ˜

st ∼ MultinoulliL({αi})

[1] J. Ba et. al. “Multiple object recognition w

To reduce the estimator variance, entropy term H[s] and bias are added [1,2]

slide-29
SLIDE 29

Hard attention

29

We have two sequences ‘I’ that runs over localizations ‘t’ that runs over words Stochastic decisions are discrete here, so derivatives are zero

ˆ zt = φ ({ai} , {αi})

eti =fatt(ai, ht−1) αti = exp(eti) PL

k=1 exp(etk)

. by

Loss is a variational lower bound on the marginal log-likelihood Due to Jensen’s inequality

Ls = X

s

p(s | a) log p(y | s, a)  log X

s

p(s | a)p(y | s, a) = log p(y | a)

the marginal log-likelihood

their

E[log(X)]≤ lo

Due to Jensen’s inequality

p(st,i = 1 | sj<t, a) = αt,i ˆ zt = X

i

st,iai.

∂Ls ∂W = X

s

p(s | a) ∂ log p(y | s, a) ∂W +

function

 log p(y | s, a)∂ log p(s | a) ∂W

  • .

i}

E[log(X)]≤ log(E[X])

lity

[1] J. Ba et al. “Multiple object recognition with visual attention” [2] A. Mnih et al. “Neural variational inference and learning in belief networks”

∂Ls ∂W ≈ 1 N

N

X

n=1

∂ log p(y | ˜ sn, a) ∂W + λr(log p(y | ˜ sn, a) − b)∂ log p(˜ sn | a) ∂W + λe ∂H[˜ sn] ∂W

  • To reduce the estimator variance, entropy term H[s]

W

returns a

∂Ls ∂W ≈ 1 N

N

X

n=1

∂ log p(y | ˜ sn, a) ∂W + log p(y | ˜ sn, a)∂ log p(˜ sn | a) ∂W

  • ˜

st ∼ MultinoulliL({αi})

[1] J. Ba et. al. “Multiple object recognition w

To reduce the estimator variance, entropy term H[s] and bias are added [1,2]

  • Instead of a soft interpolation, make a

ze zero ro-one

  • ne dec

decis ision ion about where to attend

  • Harder to train, requires methods such as

reinforcement learning

slide-30
SLIDE 30

Soft attention

30

| ˆ zt = X

i

st,iai.

Instead o

the

Ep(st|a)[ˆ zt] =

L

X

i=1

αt,iai

φ ({ai} , {αi}) = PL

i αiai

et al. (2014). This corresponds

T d

Instead of making hard decisions, we take the expected context vector The whole model is smooth and differentiable under the deterministic attention; learning via a standard backprop The Theor

  • retical argum

gument nts

  • equals to computing ht using a single forward prop with the expected context vector
  • Normalized Weighted Geometric Mean approximation [1]
  • Finally

ectors fed

a)]

Finally

NWGM[p(yt = k | a)] = Q

i exp(nt,k,i)p(st,i=1|a)

P

j

Q

i exp(nt,j,i)p(st,i=1|a)

Q P Q = exp(Ep(st|a)[nt,k]) P

j exp(Ep(st|a)[nt,j])

E[nt] = Lo(Eyt−1 + LhE[ht] + LzE[ˆ zt]). the NWGM of a softmax unit is obtained by

[1] P. Baldi et al. “The dropout learning algorithm”

[1] NWGM[p(yt = k | a)] ≈ E[p(yt = k | a)]

softmax activation. That means the expectation of the

Q P Q

using a single forw ector Ep(st|a)[ˆ zt]. ,

|

ctor

Theoretical argu

alue Ep(st|a)[ht] ard prop with

  • eq
  • Normalized We
slide-31
SLIDE 31

How soft/hard attention works

31
slide-32
SLIDE 32

How soft/hard attention works

32

Sample regions of attention A variational lower bound of maximum likelihood Computes the expexted attention

slide-33
SLIDE 33 33

Hard Attention

slide-34
SLIDE 34 34

Soft Attention

slide-35
SLIDE 35

The Good

35
slide-36
SLIDE 36

And the Bad

36
slide-37
SLIDE 37

Quantitative results

37

Human Automatic Model M1 M2 BLEU CIDEr Human 0.638 0.675 0.471 0.91 Google? 0.273 0.317 0.587 0.946 MSR• 0.268 0.322 0.567 0.925 Attention-based⇤ 0.262 0.272 0.523 0.878 Captivator 0.250 0.301 0.601 0.937 Berkeley LRCN⇧ 0.246 0.268 0.534 0.891

M1: human preferred (or equal) the method over human annotation M2: turing test

  • Add soft attention to image captioning: +2

+2 BL BLEU

  • Add hard attention to image captioning: +4

+4 BL BLEU

slide-38
SLIDE 38

Video Description Generation

38
  • Two encoders

− Context set consists of per-frame context vectors, and attention mechanism that selects one of those vectors for each output symbol being decoded – capturing the global temporal structure across frames − 3-D conv-net that applies local filters across spatio-temporal dimensions working on motion statistics

  • Both encoders are complementary
TABLE IV THE PERFORMANCE OF THE VIDEO DESCRIPTION GENERATION MODELS ON YOUTUBE2TEXT AND MONTREAL DVS. (?) HIGHER THE BETTER. () LOWER THE BETTER. Youtube2Text Montreal DVS Model METEOR? Perplexity METEOR Perplexity Enc-Dec 0.2868 33.09 0.044 88.28 + 3-D CNN 0.2832 33.42 0.051 84.41 + Per-frame CNN 0.2900 27.89 .040 66.63 + Both 0.2960 27.55 0.057 65.44

the

  • L. Yao et. al. “Describing videos by exploiting temporal structure”

3D ConvNet

slide-39
SLIDE 39

Memory Networks

39
slide-40
SLIDE 40

Why attention?

  • Long term memories - attending to memories

− Dealing with gradient vanishing problem

  • Exceeding limitations of a global representation

− Attending/focusing to smaller parts of data

§ patches in images § words or phrases in sentences

  • Decoupling representation from a problem

− Different problems required different sizes of representations

§ LSTM with longer sentences requires larger vectors

  • Overcoming computational limits for visual data

− Focusing only on the parts of images − Scalability independent of the size of images

  • Adds some interpretability to the models (error inspection)
40
slide-41
SLIDE 41

Recurrent net memory Attention mechanism

Attention on Memory Elements

  • Rec

Recurren ent net etworks can annot rem emem ember er things for ver very y long

  • The cortex only remember things for 20 seconds
  • We

We need eed a a “h “hippocam ampus” ” (a a sep epar arat ate e mem emory y module) e)

  • LSTM [Hochreiter 1997], registers
  • Me

Memo mory ry network rks [Weston et 2014] (FAIR), associative memory

  • NTM [Graves et al. 2014], “tape”.
slide-42
SLIDE 42

Recall: Long-Term Dependencies

  • The RNN gradient is a product of Jacobian matrices, each associated

with a step in the forward computation. To store information robustly in a finite-dimensional state, the dynamics must be contractive [Bengio et al 1994].

  • Problems:
  • sing. values of Jacobians > 1 à gradients explode
  • or sing. values < 1 à gradients shrink & vanish
  • or random à variance grows exponentially
42

Storing bits robustly requires

  • sing. values<1

(Hochreiter 1991) Gr Gradient cl clipping

slide-43
SLIDE 43

× input input gate forget gate

  • utput gate
  • utput

state self-loop × + ×

Gated Recurrent Units & LSTM

  • Cr

Create eate a a pa path wh where gr gradi dient nts s ca can fl flow fo for longer wi with se self-lo loop

  • Corresponds to an eigenvalue of

Jacobian slightly less than 1

  • LSTM is he

heavily use used (Hochreiter & Schmidhuber 1997)

  • GRU light-weight version

(Cho et al 2014)

43
slide-44
SLIDE 44

xt xt−1 xt+1 x unfold s

  • st−1
  • t−1
  • t

st st+1

  • t+1

W1 W3 W1 W1 W1 W1 W3

st−2

W3 W3 W3

Delays & Hierarchies to Reach Farther

  • Delays and multiple time

scales, Elhihi & Bengio NIPS 1995, Koutnik et al ICML 2014

44

Hierarchical RNNs (words / sentences): Sordoni et al CIKM 2015, Serban et al AAAI 2016

slide-45
SLIDE 45

Large Memory Networks: Sparse Access Memory for Long-Term Dependencies

  • A mental state stored in an external memory can stay for arbitrarily

long durations, until evoked for read or write

  • Forgetting = vanishing gradient.
  • Memory = larger state, avoiding the need for forgetting/vanishing
45

passive copy access

slide-46
SLIDE 46

Memory Networks

  • Class of models that combine large memory with learning component

that can read and write to it.

  • Incorporates re

reasoning soning with at atten ention over memor memory (RAM).

  • Most ML has limited memory which is more-or-less all that’s needed for

“low level” tasks e.g. object detection.

46
slide-47
SLIDE 47

Scenario 1

Joe went to the kitchen. Fred went to the kitchen. Joe picked up the milk. Joe travelled to the office. Joe left the milk. Joe went to the bathroom.

47
slide-48
SLIDE 48

Scenario 1

Joe went to the kitchen. Fred went to the kitchen. Joe picked up the milk. Joe travelled to the office. Joe left the milk. Joe went to the bathroom. Where is the milk now? Where is Joe? Where was Joe before the office?

48
slide-49
SLIDE 49

Scenario 1

Joe went to the kitchen. Fred went to the kitchen. Joe picked up the milk. Joe travelled to the office. Joe left the milk. Joe went to the bathroom. Where is the milk now? A: office Where is Joe? Where was Joe before the office?

49
slide-50
SLIDE 50

Scenario 1

Joe went to the kitchen. Fred went to the kitchen. Joe picked up the milk. Joe travelled to the office. Joe left the milk. Joe went to the bathroom. Where is the milk now? A: office Where is Joe? A: bathroom Where was Joe before the office?

50
slide-51
SLIDE 51

Scenario 1

Joe went to the kitchen. Fred went to the kitchen. Joe picked up the milk. Joe travelled to the office. Joe left the milk. Joe went to the bathroom. Where is the milk now? A: office Where is Joe? A: bathroom Where was Joe before the office? A: kitchen

51
slide-52
SLIDE 52

Scenario 2

52

Scenario 2

slide-53
SLIDE 53 53

Scenario 2

Scenario 2

slide-54
SLIDE 54 54

Scenario 2

Baxter

Scenario 2

slide-55
SLIDE 55 55

Shaolin Soccer directed by Stephen Chow Shaolin Soccer written by Stephen Chow Shaolin Soccer starred actors Stephen Chow Shaolin Soccer release year 2001 Shaolin Soccer has genre comedy Shaolin Soccer has tags martial arts, kung fu soccer, stephen chow Kung Fu Hustle directed by Stephen Chow Kung Fu Hustle written by Stephen Chow Kung Fu Hustle starred actors Stephen Chow Kung Fu Hustle has genre comedy action Kung Fu Hustle has imdb votes famous Kung Fu Hustle has tags comedy, action, martial arts, kung fu, china, soccer, hong kong, stephen chow The God of Cookery directed by Stephen Chow The God of Cookery written by Stephen Chow The God of Cookery starred actors Stephen Chow The God of Cookery has tags hong kong Stephen Chow From Beijing with Love directed by Stephen Chow From Beijing with Love written by Stephen Chow From Beijing with Love starred actors Stephen Chow, Anita Yuen ...<and more> ... 1) I’m looking a fun comedy to watch tonight, any ideas?

Scenario 3

slide-56
SLIDE 56

Scenario 3

Who wrote Kung Fu Hustle?

56

Shaolin Soccer directed by Stephen Chow Shaolin Soccer written by Stephen Chow Shaolin Soccer starred actors Stephen Chow Shaolin Soccer release year 2001 Shaolin Soccer has genre comedy Shaolin Soccer has tags martial arts, kung fu soccer, stephen chow Kung Fu Hustle directed by Stephen Chow Kung Fu Hustle written by Stephen Chow Kung Fu Hustle starred actors Stephen Chow Kung Fu Hustle has genre comedy action Kung Fu Hustle has imdb votes famous Kung Fu Hustle has tags comedy, action, martial arts, kung fu, china, soccer, hong kong, stephen chow The God of Cookery directed by Stephen Chow The God of Cookery written by Stephen Chow The God of Cookery starred actors Stephen Chow The God of Cookery has tags hong kong Stephen Chow From Beijing with Love directed by Stephen Chow From Beijing with Love written by Stephen Chow From Beijing with Love starred actors Stephen Chow, Anita Yuen ...<and more> ... 1) I’m looking a fun comedy to watch tonight, any ideas?

slide-57
SLIDE 57

Scenario 3

I’m interested in watching a Stephen Chow movie other than Kung Fu Hustle. Can you suggest something?

57

Shaolin Soccer directed by Stephen Chow Shaolin Soccer written by Stephen Chow Shaolin Soccer starred actors Stephen Chow Shaolin Soccer release year 2001 Shaolin Soccer has genre comedy Shaolin Soccer has tags martial arts, kung fu soccer, stephen chow Kung Fu Hustle directed by Stephen Chow Kung Fu Hustle written by Stephen Chow Kung Fu Hustle starred actors Stephen Chow Kung Fu Hustle has genre comedy action Kung Fu Hustle has imdb votes famous Kung Fu Hustle has tags comedy, action, martial arts, kung fu, china, soccer, hong kong, stephen chow The God of Cookery directed by Stephen Chow The God of Cookery written by Stephen Chow The God of Cookery starred actors Stephen Chow The God of Cookery has tags hong kong Stephen Chow From Beijing with Love directed by Stephen Chow From Beijing with Love written by Stephen Chow From Beijing with Love starred actors Stephen Chow, Anita Yuen ...<and more> ... 1) I’m looking a fun comedy to watch tonight, any ideas?

slide-58
SLIDE 58

What is required?

  • Not all problems can be mapped to y = f(x)
  • The model needs to remember external context
  • Given an input, the model needs to know where to look for in the context
  • It needs to know what to look for in the context
  • It needs to know how to reason using this external context
  • It needs to handle the potentially changing external context
59

What is Required?

Not all problems can be mapped to y = f(x)

X Y = fW (X) fW

slide-59
SLIDE 59

What is a Memory Network?

Original paper description of class of models MemNNs have four component networks (which may or may not have shared parameters):

  • I:

I: (input feature map) convert incoming data to the internal feature representation.

  • G:

G: (generalization) update memories given new input.

  • O:

O: produce new output (in feature representation space) given the memories.

  • R:

R: (response) convert output O into a response seen by the outside world.

60
slide-60
SLIDE 60

Memory Networks- Some early publications

§ J. Weston, S. Chopra, A. Bordes. Memory Networks. ICLR 2015 (and arXiv:1410.3916). § S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus. End-To-End Memory Networks. NIPS 2015 (and arXiv:1503.08895). § J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. van Merriënboer, A. Joulin, T. Mikolov. Towards AI- Complete Question Answering: A Set of Prerequisite Toy Tasks. arXiv:1502.05698. § A. Bordes, N. Usunier, S. Chopra, J. Weston. Large-scale Simple Question Answering with Memory

  • Networks. arXiv:1506.02075.

§ J. Dodge, A. Gane, X. Zhang, A. Bordes, S. Chopra, A. Miller, A. Szlam, J. Weston. Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems. arXiv:1511.06931. § F. Hill, A. Bordes, S. Chopra, J. Weston. The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations. arXiv:1511.02301. § J. Weston. Dialog-based Language Learning. arXiv:1604.06045. § A. Bordes, Jason Weston. Learning End-to-End Goal-Oriented Dialog. arXiv:1605.07683.

61
slide-61
SLIDE 61

Memory Module

Controller module Input

a d d r e s s i n g read a d d r e s s i n g read

Internal state Vector (initially: query)

Output

Memory vectors Supervision (direct or reward-based)

m

m

q

Memory Network Models

implemented models..

Figure: Saina Sukhbaatar

62
slide-62
SLIDE 62

Variants of the class…

Some options and extensions:

  • Representation of inputs and memories could use all kinds of

encodings: bag of words, RNN style reading at word or character level, etc.

  • Different possibilities for output module: e.g. multi-class classifier or

uses an RNN to output sentences.

  • If the memory is huge (e.g. Wikipedia) we need to organize the
  • memories. Solution: hash the memories to store in buckets (topics). Then,

memory addressing and reading doesn’t operate on all memories.

  • If the memory is full, there could be a way of removing one it thinks is

most useless; i.e. it ``forgets’’ somehow. That would require a scoring function of the utility of each memory..

63