Task (1) Factoid QA with Single Supporting Fact (where is actor) - - PowerPoint PPT Presentation

task 1 factoid qa with single supporting fact where is
SMART_READER_LITE
LIVE PREVIEW

Task (1) Factoid QA with Single Supporting Fact (where is actor) - - PowerPoint PPT Presentation

Task (1) Factoid QA with Single Supporting Fact (where is actor) (Very Simple) Toy reading comprehension task: John was in the bedroom. Bob was in the office. SUPPORTING FACT John went to kitchen. Bob travelled back home. Where is


slide-1
SLIDE 1

Task (1) Factoid QA with Single Supporting Fact (“where is actor”)

(Very Simple) Toy reading comprehension task:

John was in the bedroom. Bob was in the office. John went to kitchen. Bob travelled back home. Where is John? A:kitchen

SUPPORTING FACT

64

slide-2
SLIDE 2

Memory Networks (Fully Supervised)

65

John was in the bathroom. Bob was in the office. John went to kitchen. Bob travelled back home Where is John? A: kitchen Context

slide-3
SLIDE 3

John was in the bathroom. Bob was in the office. John went to kitchen. Bob travelled back home Where is John? A: kitchen

Memory Networks (Fully Supervised)

66

Context Question, Answer Pair

slide-4
SLIDE 4

John was in the bathroom. Bob was in the office. John went to kitchen. Bob travelled back home Where is John? A: kitchen

Memory Networks (Fully Supervised)

67

Context Question, Answer Pair St Step 1

  • Store the representations of facts in the memory
  • Free to choose what representations you store
  • Individual words - window of words - full sentences
  • Bag-of-words - CNN - RNN - LSTM

Supporting Fact

slide-5
SLIDE 5

Memory Networks (Fully Supervised)

68

Memories St Step 1

  • Store the representations of facts in the memory
  • Free to choose what representations you store
  • Individual words - window of words - full sentences
  • Bag-of-words - CNN - RNN - LSTM

mi = f(John was in the bathroom.) mi+1 = f(Bob was in the office.) mi+2 = f(John went to the kitchen.) mi+3 = f(Bob travelled back home.)

Memories

John was in the bathroom. Bob was in the office. John went to kitchen. Bob travelled back home Where is John? A: kitchen

slide-6
SLIDE 6

Memory Networks (Fully Supervised)

69

Memories St Step 2

  • Represent the question using similar function.

mi = f(John was in the bathroom.) mi+1 = f(Bob was in the office.) mi+2 = f(John went to the kitchen.) mi+3 = f(Bob travelled back home.)

Memories

x = f(Where is John?)

John was in the bathroom. Bob was in the office. John went to kitchen. Bob travelled back home Where is John? A: kitchen

slide-7
SLIDE 7

Memory Networks (Fully Supervised)

70

Memories St Step 3

  • Define a scoring function S

S and score the memories with the question

  • Scoring function should be such that it gives a high score to the relevant memories:

S(Where is John?, John went to the kitchen.) > S(Where is John?, Bob travelled back home.)

mi = f(John was in the bathroom.) mi+1 = f(Bob was in the office.) mi+2 = f(John went to the kitchen.) mi+3 = f(Bob travelled back home.)

Memories

x = f(Where is John?)

John was in the bathroom. Bob was in the office. John went to kitchen. Bob travelled back home Where is John? A: kitchen

slide-8
SLIDE 8

John was in the bathroom. Bob was in the office. John went to kitchen. Bob travelled back home Where is John? A: kitchen

Memory Networks (Fully Supervised)

71

Memories

St Step 3

  • Define a scoring function S

S and score the memories with the question

  • Scoring function should be such that it gives a high score to the relevant memories:

S(Where is John?, John went to the kitchen.) > S(Where is John?, Bob travelled back home.)

mi = f(John was in the bathroom.) mi+1 = f(Bob was in the office.) mi+2 = f(John went to the kitchen.) mi+3 = f(Bob travelled back home.)

Memories

x = f(Where is John?)

Ex Exampl ple Choi hoice ces

qU tUd Gw(q, d)

slide-9
SLIDE 9

Memory Networks (Fully Supervised)

72

Memories St Step 4

  • Define another parametric function which maps the current question and relevant

memories to the final response

  • In the first experiments, this was another scoring function which scored all

possible responses against the given input and memories

mi = f(John was in the bathroom.) mi+1 = f(Bob was in the office.) mi+2 = f(John went to the kitchen.) mi+3 = f(Bob travelled back home.)

Memories

x = f(Where is John?)

John was in the bathroom. Bob was in the office. John went to kitchen. Bob travelled back home Where is John? A: kitchen

slide-10
SLIDE 10

Memory Networks (Fully Supervised)

73

Memories Inference ce

  • Given the question, pick the memory which scores the highest
  • Use the selected memory and the question to generate the answer

mi = f(John was in the bathroom.) mi+1 = f(Bob was in the office.) mi+2 = f(John went to the kitchen.) mi+3 = f(Bob travelled back home.)

Memories

x = f(Where is John?)

John was in the bathroom. Bob was in the office. John went to kitchen. Bob travelled back home Where is John? A: kitchen

slide-11
SLIDE 11

Memory Networks (Fully Supervised)

74

Memories Tr Training ining

  • It involves training the memory

representations and the scoring functions to generate answer

  • We do so my minimizing the

following loss

mi = f(John was in the bathroom.) mi+1 = f(Bob was in the office.) mi+2 = f(John went to the kitchen.) mi+3 = f(Bob travelled back home.)

Memories

x = f(Where is John?)

L = X

¯ f6=mo1

max(0, γ − So(x, mo1) + So(x, ¯ f))+ X

¯ r6=r

max(0, γ − Sr([x, mo1], r) + Sr([x, mo1], ¯ r))

slide-12
SLIDE 12

Memory Networks (Fully Supervised)

75

Memories Tr Training ining

  • It involves training the memory

representations and the scoring functions to generate answer

  • We do so my minimizing the

following loss

mi = f(John was in the bathroom.) mi+1 = f(Bob was in the office.) mi+2 = f(John went to the kitchen.) mi+3 = f(Bob travelled back home.)

Memories

x = f(Where is John?)

L = X

¯ f6=mo1

max(0, γ − So(x, mo1) + So(x, ¯ f))+ X

¯ r6=r

max(0, γ − Sr([x, mo1], r) + Sr([x, mo1], ¯ r))

We had access to true supporting fact during training that’s what we mean by “Fully Supervised”

So : scoring function for memories Sr : scoring function for responses This s was s the ca case se when we have a si single su supporting fact ct!

slide-13
SLIDE 13

(2) Factoid QA with Two Supporting Facts (“where is actor+object”)

76

John is in the playground. Bob is in the office. John picked up the football. Bob went to the kitchen. Where is the football? A:playground Where was Bob before the kitchen? A:office

A harder (toy) task is to answer questions where two supporting statements have to be chained to answer the question:

slide-14
SLIDE 14

A harder (toy) task is to answer questions where two supporting statements have to be chained to answer the question:

John is in the playground. Bob is in the office. John picked up the football. Bob went to the kitchen. Where is the football? A:playground Where was Bob before the kitchen? A:office

To answer the first question Where is the football? both John picked up the football and John is in the playground are supporting facts.

SUPPORTING FACT 2 SUPPORTING FACT 1

77

(2) Factoid QA with Two Supporting Facts (“where is actor+object”)

slide-15
SLIDE 15

Memory Networks (Fully Supervised)

78

John is in the playground. Bob isin the office. John picked up the football. Bob went to the kitchen. Where is the football? A: playground Supporting Fact 1 Supporting Fact 2 The current loss function will not work! L = X

¯ f6=mo1

max(0, γ − So(x, mo1) + So(x, ¯ f))+ X

¯ r6=r

max(0, γ − Sr([x, mo1], r) + Sr([x, mo1], ¯ r))

But the cool thing is that we can iterate!

slide-16
SLIDE 16

Memory Networks (Fully Supervised)

79

John is in the playground. Bob isin the office. John picked up the football. Bob went to the kitchen. Where is the football? A: playground The current loss function will not work! Supporting Fact 1 Supporting Fact 2 L = X

¯ f6=mo1

max(0, γ − So(x, mo1) + So(x, ¯ f))+ X

¯ r6=r

max(0, γ − Sr([x, mo1], r) + Sr([x, mo1], ¯ r))

But the cool thing is that we can iterate!

But the co cool thing is s that we ca can iterate!

slide-17
SLIDE 17

Memory Module

Controller module Input

a d d r e s s i n g read a d d r e s s i n g read

Internal state Vector (initially: query)

Output

Memory vectors Supervision (direct or reward-based)

m

m

q

Memory Network Models

implemented models..

Figure: Saina Sukhbaatar

80

slide-18
SLIDE 18

The First MemNN Implemention

  • I (input): converts to bag-of-word-embeddings x.
  • G (generalization): stores x in next available slot mN.
  • O (output): Loops over all memories k=1 or 2 times:
  • 1st loop max: finds best match mi with x.
  • 2nd loop max: finds best match mj with (x, mi).
  • The output o is represented with (x, mi, mj).
  • R (response): ranks all words in the dictionary given o and returns

best single word. (OR: use a full RNN here)

81

slide-19
SLIDE 19

Matching function

  • For a given Q, we want a good match to the relevant memory slot(s)

containing the answer, e.g.: Match(Where is the football ?, John picked up the football)

  • We use a qTUTUd embedding model with word embedding features:

− LHS features: Q:Where Q:is Q:the Q:football Q:? − RHS features: D:John D:picked D:up D:the D:football QDMatch:the QDMatch:football

(QD QDMatch:football is is a fea eatur ure e to

  • say ther

here’ e’s a Q&A wor

  • rd

d match, h, whic hich h can n help. help.)

The parameters U are trained with a margin ranking loss: supporting facts should score higher than non-supporting facts.

82

slide-20
SLIDE 20

Matching function: 2nd hop

  • On the 2nd hop we match question & 1st hop to new fact:

Match( [Where is the football ?, John picked up the football], J

John is in the playground)

  • We use the same qTUTUd embedding model:

− LHS features: Q:Where Q:is Q:the Q:football Q:? Q2: John Q2:picked Q2:up Q2:the Q2:football − RHS features: D:John D:is D:in D:the D:playground QDMatch:the QDMatch:is .. Q2DMatch:John

83

slide-21
SLIDE 21

Objective function

Minimize

where SO is the matching function for the Output component. SR is the matching function for the Response component. x is the input question. mO1 is the first true supporting memory (fact). mO2 is the first second supporting memory (fact). r is the response True facts and responses mO1, mO2 and r should have higher scores than all other facts and responses by a given margin.

84

slide-22
SLIDE 22

Comparing triples

  • We also need time information for the bAbI tasks. We tried adding

absolute time as a feature: it works, but the following idea can be better:

  • Seems to work better if we compare triples:
  • Match(Q,D,D’) returns < 0 if D is better than D’

returns > 0 if D’ is better than D We can loop through memories, keep best mi at each step. Now the features include relative time features: L.H.S: same as before R.H.S: features(D) DbeforeQ: 0-or-1

  • features(D’) D’beforeQ: 0-or-1 DbeforeD’: 0-or-1

85

slide-23
SLIDE 23

Comparing triples: Objective and Inference

Similar to before, except now for both mo1 and mo2 we need to have two terms considering them as the second or third argument to the SOt as they may appear on either side during inference:

86

slide-24
SLIDE 24

bAbI tasks: what reasoning tasks would we like models to work on?

  • We define 20 tasks (generated by the simulation) that we can test

new models on. (See: http://fb.ai/babi)

  • The idea is they are a bit like software tests:

each task checks if an ML system has a certain skill.

  • We would like each “skill” we check to be a natural task for humans

w.r.t. text understanding & reasoning, humans should be able to get 100%.

  • J. Weston, A. Bordes, S. Chopra, T. Mikolov. Towards AI-Complete

Question Answering: A Set of Prerequisite Toy Tasks. arXiv:1502.05698.

87

slide-25
SLIDE 25

88

slide-26
SLIDE 26

89

slide-27
SLIDE 27

Task (1) Factoid QA with Single Supporting Fact (“where is actor”)

Our first task consists of questions where a single supporting fact, previously given, provides the answer. We test simplest case of this, by asking for the location of a person. A small sample of the task is thus:

We could use supporting facts for supervision at training time, but are not known at test time (we call this “strong supervision”). However weak supervision is much better!!

John is in the playground. Bob is in the office. Where is John? A:playground

SUPPORTING FACT

90

slide-28
SLIDE 28

(2) Factoid QA with Two Supporting Facts (“where is actor+object”)

A harder task is to answer questions where two supporting statements have to be chained to answer the question:

John is in the playground. Bob is in the office. John picked up the football. Bob went to the kitchen. Where is the football? A:playground

To answer the question Where is the football? both John picked up the football and John is in the playground are supporting facts.

SUPPORTING FACT SUPPORTING FACT

91

slide-29
SLIDE 29

(3) Factoid QA with Three Supporting Facts

Similarly, one can make a task with three supporting facts:

John picked up the apple. John went to the office. John went to the kitchen. John dropped the apple. Where was the apple before the kitchen? A:office

The first three statements are all required to answer this.

92

slide-30
SLIDE 30

(4) Two Argument Relations: Subject vs. Object

To answer questions the ability to differentiate and recognize subjects and objects is crucial. We consider the extreme case: sentences feature re-ordered words:

The office is north of the bedroom. The bedroom is north of the bathroom. What is north of the bedroom? A:office What is the bedroom north of? A:bathroom

Note that the two questions above have exactly the same words, but in a different order, and different answers. So a bag-of-words will not work.

93

slide-31
SLIDE 31

(6) Yes/No Questions

  • This task tests, in the simplest case possible (with a single

supporting fact) the ability of a model to answer true/false type questions:

John is in the playground. Daniel picks up the milk. Is John in the classroom? A:no Does Daniel have the milk? A:yes

94

slide-32
SLIDE 32

(8) Lists/Sets

  • Tests ability to count sets:
  • Tests ability to produce lists/sets:

Daniel picks up the football. Daniel drops the newspaper. Daniel picks up the milk. What is Daniel holding? A:milk,football Daniel picked up the football. Daniel dropped the football. Daniel got the milk. Daniel took the apple. How many objects is Daniel holding? A:two

(7) Counting

95

slide-33
SLIDE 33

(11) Basic Coreference (nearest referent)

Daniel was in the kitchen. Then he went to the studio. Sandra was in the office. Where is Daniel? A:studio Daniel and Sandra journeyed to the office. Then they went to the garden. Sandra and John travelled to the kitchen. After that they moved to the hallway. Where is Daniel? A:garden

(13) Compound Coreference

96

slide-34
SLIDE 34

(14) Time manipulation

  • While our tasks so far have included time implicitly in the order of the

statements, this task tests understanding the use of time expressions within the statements:

In the afternoon Julie went to the park. Yesterday Julie was at school. Julie went to the cinema this evening. Where did Julie go after the park? A:cinema

Much ch harder difficu culty: adapt a real time expression labeling dataset into a question answer format, e.g. Uzzaman et al., ‘12.

97

slide-35
SLIDE 35

(15) Basic Deduction

  • This task tests basic deduction via inheritance of properties:

Sheep are afraid of wolves. Cats are afraid of dogs. Mice are afraid of cats. Gertrude is a sheep. What is Gertrude afraid of? A:wolves

Deduction should prove difficult for MemNNs because it effectively involves search, although our setup might be simple enough for it.

98

slide-36
SLIDE 36

(17) Positional Reasoning

  • This task tests spatial reasoning, one of many components of the

classical SHRDLU system:

The triangle is to the right of the blue square. The red square is on top of the blue square. The red sphere is to the right of the blue square. Is the red sphere to the right of the blue square? A:yes Is the red square to the left of the triangle? A:yes

99

slide-37
SLIDE 37

(18) Reasoning about size

  • This tasks requires reasoning about relative size of objects and is

inspired by the commonsense reasoning examples in the Winograd schema challenge:

The football fits in the suitcase. The suitcase fits in the cupboard. The box of chocolates is smaller than the football. Will the box of chocolates fit in the suitcase? A:yes

Tasks 3 (three supporting facts) and 6 (Yes/No) are prerequisites.

100

slide-38
SLIDE 38

(19) Path Finding

  • In this task the goal is to find the path between locations:

The kitchen is north of the hallway. The den is east of the hallway. How do you go from den to kitchen? A:west,north

This is going to prove difficult for MemNNs because it effectively involves search.

101

slide-39
SLIDE 39

End-to-end Memory Networks

102

End-to-end Memory Networks, S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus. NIPS 2015

slide-40
SLIDE 40

End-to-end Memory Network (MemN2N)

  • New end-to-end (MemN2N) model (Sukhbaatar ‘15):
  • Reads from memory with so

soft attention

  • Performs multiple looku

kups (hops) on memory

  • End-to-end training with backp

ckpropagation

  • Only need supervision on the final output
  • It is based on “Memory Networks” by

[Weston, Chopra & Bordes ICLR 2015] but that had:

  • Hard attention
  • requires explicit supervision of attention during training
  • Only feasible for simple tasks

103

slide-41
SLIDE 41

Memory Module Controller module

Input

MemN2N architecture

Output supervision

addressing read addressing read

Memory vectors (unordered) Internal state vector

104

slide-42
SLIDE 42

Memory Module

Dot Product Softmax Weighted Sum

To controller (added to controller state)

Addressing signal (controller state vector) Memory vectors

Attention weights / Soft address

105

slide-43
SLIDE 43

Question

Where is Sam?

Input story

Memory Module Controller

kitchen

Answer

Dot product + softmax Weighted Sum

Question & Answering

2: Sam went to kitchen 1: Sam moved to garden 3: Sam drops apple there

106

slide-44
SLIDE 44

Memory Vectors

E.g.) constructing memory vectors with Bag-of-Words (BoW)

  • 1. Embed each word
  • 2. Sum embedding vectors

E.g.) temporal structure: special words for time and include them in BoW

Memory Vector Embedding Vectors Time embedding

107

slide-45
SLIDE 45

Positional Encoding of Words

Rep Repres esen entati tation of f inputs ts an and mem emories es could use e al all kinds of f en encodings: bag of words, RNN style reading at word or character level, etc.

We We also built a positional encoding variant:

Words are represented by vectors as before. But instead of a bag, position is modeled by a multiplicative term on each word vector with weights depending on the position in the sentence.

108

slide-46
SLIDE 46

Weakly supervised

Training on 1k stories

Supervised Supp. Facts

109

TA TASK N-gr grams LS LSTMs MemN2 N2N Memory ry Ne Network rks St StructSVM SVM+ co coref+sr srl

  • T1. Single supporting fact

36 50 PASS PASS PASS

  • T2. Two supporting facts

2 20 87 PASS 74

  • T3. Three supporting facts

7 20 60 PASS 17

  • T4. Two arguments relations

50 61 PASS PASS PASS

  • T5. Three arguments relations

20 70 87 PASS 83

  • T6. Yes/no questions

49 48 92 PASS PASS

  • T7. Counting

52 49 83 85 69

  • T8. Sets

40 45 90 91 70

  • T9. Simple negation

62 64 87 PASS PASS

  • T10. Indefinite knowledge

45 44 85 PASS PASS

  • T11. Basic coreference

29 72 PASS PASS PASS

  • T12. Conjunction

9 74 PASS PASS PASS

  • T13. Compound coreference

26 PASS PASS PASS PASS

  • T14. Time reasoning

19 27 PASS PASS PASS

  • T15. Basic deduction

20 21 PASS PASS PASS

  • T16. Basic induction

43 23 PASS PASS 24

  • T17. Positional reasoning

46 51 49 65 61

  • T18. Size reasoning

52 52 89 PASS 62

slide-47
SLIDE 47

Attention during memory lookups

Story (1: 1 supporting fact) Support Hop 1 Hop 2 Hop 3 Story (2: 2 supporting facts) Support Hop 1 Hop 2 Hop 3 Daniel went to the bathroom. 0.00 0.00 0.03 John dropped the milk. 0.06 0.00 0.00 Mary travelled to the hallway. 0.00 0.00 0.00 John took the milk there. yes 0.88 1.00 0.00 John went to the bedroom. 0.37 0.02 0.00 Sandra went back to the bathroom. 0.00 0.00 0.00 John travelled to the bathroom. yes 0.60 0.98 0.96 John moved to the hallway. yes 0.00 0.00 1.00 Mary went to the office. 0.01 0.00 0.00 Mary went back to the bedroom. 0.00 0.00 0.00 Story (16: basic induction) Support Hop 1 Hop 2 Hop 3 Story (18: size reasoning) Support Hop 1 Hop 2 Hop 3 Brian is a frog. yes 0.00 0.98 0.00 The suitcase is bigger than the chest. yes 0.00 0.88 0.00 Lily is gray. 0.07 0.00 0.00 The box is bigger than the chocolate. 0.04 0.05 0.10 Brian is yellow. yes 0.07 0.00 1.00 The chest is bigger than the chocolate. yes 0.17 0.07 0.90 Julius is green. 0.06 0.00 0.00 The chest fits inside the container. 0.00 0.00 0.00 Greg is a frog. yes 0.76 0.02 0.00 The chest fits inside the box. 0.00 0.00 0.00 Where is John? Answer: bathroom Prediction: bathroom Where is the milk? Answer: hallway Prediction: hallway What color is Greg? Answer: yellow Prediction: yellow Does the suitcase fit in the chocolate? Answer: no Prediction: no

Samples from toy QA tasks

Test Acc Failed tasks MemNN 93.3% 4 LSTM 49% 20 MemN2N 1 hop 74.82% 17 2 hops 84.4% 11 3 hops 87.6.% 11

20 bAbI Tasks

110

slide-48
SLIDE 48

How about on real data?

  • Toy AI tasks are important for developing innovative methods.
  • But they do not give all the answers.
  • How do these models work on real data?

− Classic Language Modeling (Penn TreeBank, Text8) − Story understanding (Children’s Book Test, News articles) − Open Question Answering (WebQuestions, WikiQA) − Goal-Oriented Dialog and Chit-Chat (Movie Dialog, Ubuntu)

111

slide-49
SLIDE 49

112

slide-50
SLIDE 50

113

slide-51
SLIDE 51

114

slide-52
SLIDE 52

Self-Supervision Memory Network

Two tricks together that make things work a bit better: 1) Bypass module Instead of the last output module being a linear layer from the output

  • f the memory, assume the answer is one of the memories. Sum the

scores of identical memories. 2) Self-Supervision We know what the right answer is on the training data, so just directly train that memories containing the answer word to be supporting facts (have high probability).

115

slide-53
SLIDE 53

Results on Children’s Book Test

116

slide-54
SLIDE 54

CNN/Daily Mail Datasets

117

Slide credit: Danqi Chen

(Herman ann et al al, 2015)

( @entity4 ) if you feel a ripple in the force today , it may be the news that the official @entity6 is getting its first gay character . according to the sci-fi website @entity9 , the upcoming novel " @entity11 " will feature a capable but flawed @entity13 official named @entity14 who " also happens to be a lesbian . " the character is the first gay figure in the official @entity6 -- the movies , television shows , comics and books approved by @entity6 franchise owner @entity22 -- according to @entity24 , editor of " @entity6 "

P

characters in " @placeholder " movies have gradually become more diverse

@entity6

Q A

CNN: 380k, Daily Mail: 879k training - free!

slide-55
SLIDE 55

Squad Dataset

  • 100K question-answer

pairs

  • Answers to all of the

questions are segments of text, or spans, in the passage

118

(Raj ajpurkar kar et al al., 2016)

Image credit: Pranav Rajpurkar

slide-56
SLIDE 56

Dynamic Memory Networks

119

Dynamic Memory Networks for Visual and Textual Question Answering, Caiming Xiong, Stephen Merity, Richard Socher. ICML 2016

slide-57
SLIDE 57

DMN Overview

Answer module Question Module Semantic Memory Module Episodic Memory Module Input Module

Mary got the milk there. John moved to the bedroom. Sandra went back to the kitchen. Mary travelled to the hallway. John got the football there. John went to the hallway. John put down the football. Mary went to the garden.

s1 s2 s3 s4 s5 s6 s7 s8

Where is the fooball?

q

0.0 0.3 0.0 0.0 0.0 0.9 0.0 0.0 0.3 0.0 0.0 0.0 0.0 0.0 1.0 0.0

e1 e2 e3 e4 e5 e6 e7 e8

1 1 1 1 1 1 1 1

e1 e2 e3 e4 e5 e6 e7 e8

2 2 2 2 2 2 2 2

hallway <EOS>

m

1

m

2

(Glove vectors)

w1 w

T

120

slide-58
SLIDE 58

The Modules: Input

121

Answer module Question Module Semantic Memory Module Episodic Memory Module Input Module

Mary got the milk there. John moved to the bedroom. Sandra went back to the kitchen. Mary travelled to the hallway. John got the football there. John went to the hallway. John put down the football. Mary went to the garden.

s1 s2 s3 s4 s5 s6 s7 s8

Where is the fooball?

q

0.0 0.3 0.0 0.0 0.0 0.9 0.0 0.0 0.3 0.0 0.0 0.0 0.0 0.0 1.0 0.0

e1 e2 e3 e4 e5 e6 e7 e8

1 1 1 1 1 1 1 1

e1 e2 e3 e4 e5 e6 e7 e8

2 2 2 2 2 2 2 2

hallway <EOS>

m

1

m

2

(Glove vectors)

w1 w

T

Input Module

Mary got the milk there. John moved to the bedroom. Sandra went back to the kitchen. Mary travelled to the hallway. John got the football there. John went to the hallway. John put down the football. Mary went to the garden.

s1 s2 s3 s4 s5 s6 s7 s8 w1 w

T

slide-59
SLIDE 59

Gated Recurrent Units in RNN

with ht = GRU(xt, ht−1):

zt = σ ⇣ W (z)xt + U (z)ht−1 + b(z)⌘ rt = σ ⇣ W (r)xt + U (r)ht−1 + b(r)⌘ ˜ ht = tanh ⇣ Wxt + rt Uht−1 + b(h)⌘ ht = zt ht−1 + (1 zt) ˜ ht,

:

Cho et al. 2014

122

with ht = GRU(xt, ht−1):

slide-60
SLIDE 60

The Modules: Question

each question consists of via qt = GRU(vt, qt−1), question vector is defined as q

123

Answer module Question Module Semantic Memory Module Episodic Memory Module Input Module

Mary got the milk there. John moved to the bedroom. Sandra went back to the kitchen. Mary travelled to the hallway. John got the football there. John went to the hallway. John put down the football. Mary went to the garden.

s1 s2 s3 s4 s5 s6 s7 s8

Where is the fooball?

q

0.0 0.3 0.0 0.0 0.0 0.9 0.0 0.0 0.3 0.0 0.0 0.0 0.0 0.0 1.0 0.0

e1 e2 e3 e4 e5 e6 e7 e8

1 1 1 1 1 1 1 1

e1 e2 e3 e4 e5 e6 e7 e8

2 2 2 2 2 2 2 2

hallway <EOS>

m

1

m

2

(Glove vectors)

w1 w

T

Question Module

Where is the fooball?

q

slide-61
SLIDE 61

The Modules: Episodic Memory

hi

t

= gi

tGRU(ct, hi t−1) + (1 gi t)hi t−1

124

Answer module Question Module Semantic Memory Module Episodic Memory Module Input Module

Mary got the milk there. John moved to the bedroom. Sandra went back to the kitchen. Mary travelled to the hallway. John got the football there. John went to the hallway. John put down the football. Mary went to the garden.

s1 s2 s3 s4 s5 s6 s7 s8

Where is the fooball?

q

0.0 0.3 0.0 0.0 0.0 0.9 0.0 0.0 0.3 0.0 0.0 0.0 0.0 0.0 1.0 0.0

e1 e2 e3 e4 e5 e6 e7 e8

1 1 1 1 1 1 1 1

e1 e2 e3 e4 e5 e6 e7 e8

2 2 2 2 2 2 2 2

hallway <EOS>

m

1

m

2

(Glove vectors)

w1 w

T

0.3 0.0 0.0 0.0 0.0 0.0 1.0 0.0

e1 e2 e3 e4 e5 e6 e7 e8

1 1 1 1 1 1 1 1

m

1

slide-62
SLIDE 62

The Modules: Episodic Memory

  • Gates are activated if relevant to the question
  • When the end of the input is reached, the relevant facts are

summarized

z(s, m, q) = [s q, s m, |s q|, |s m|, s, m, q, sT W (b)q, sT W (b)m] G(s, m, q) = σ ⇣ W (2) tanh ⇣ W (1)z(s, m, q) + b(1)⌘ + b(2)⌘

125

slide-63
SLIDE 63

The Modules: Episodic Memory

  • If summary is insufficient to answer the question, repeat sequence over

input

126

Answer module Question Module Semantic Memory Module Episodic Memory Module Input Module

Mary got the milk there. John moved to the bedroom. Sandra went back to the kitchen. Mary travelled to the hallway. John got the football there. John went to the hallway. John put down the football. Mary went to the garden.

s1 s2 s3 s4 s5 s6 s7 s8

Where is the fooball?

q

0.0 0.3 0.0 0.0 0.0 0.9 0.0 0.0 0.3 0.0 0.0 0.0 0.0 0.0 1.0 0.0

e1 e2 e3 e4 e5 e6 e7 e8

1 1 1 1 1 1 1 1

e1 e2 e3 e4 e5 e6 e7 e8

2 2 2 2 2 2 2 2

hallway <EOS>

m

1

m

2

(Glove vectors)

w1 w

T

Episodic Memory Module

0.0 0.3 0.0 0.0 0.0 0.9 0.0 0.0 0.3 0.0 0.0 0.0 0.0 0.0 1.0 0.0

e1 e2 e3 e4 e5 e6 e7 e8

1 1 1 1 1 1 1 1

e1 e2 e3 e4 e5 e6 e7 e8

2 2 2 2 2 2 2 2

m

1

m

2

slide-64
SLIDE 64

Inspiration from Neuroscience

  • Epi

Episodi sodic c memor

  • ry is the me

memo mory of autobiographical events (times, places, etc). A collection of past personal experiences that occurred at a particular time and place.

  • The hippocampus, the seat of episodic memory in humans, is active

during transitive inference

  • In the DMN repeated passes over the input are needed for transitive

inference

127

slide-65
SLIDE 65

The Modules: Answer

at = GRU([yt−1, q], at−1), yt = softmax(W (a)at),

128

Answer module Question Module Semantic Memory Module Episodic Memory Module Input Module

Mary got the milk there. John moved to the bedroom. Sandra went back to the kitchen. Mary travelled to the hallway. John got the football there. John went to the hallway. John put down the football. Mary went to the garden.

s1 s2 s3 s4 s5 s6 s7 s8

Where is the fooball?

q

0.0 0.3 0.0 0.0 0.0 0.9 0.0 0.0 0.3 0.0 0.0 0.0 0.0 0.0 1.0 0.0

e1 e2 e3 e4 e5 e6 e7 e8

1 1 1 1 1 1 1 1

e1 e2 e3 e4 e5 e6 e7 e8

2 2 2 2 2 2 2 2

hallway <EOS>

m

1

m

2

(Glove vectors)

w1 w

T

slide-66
SLIDE 66

Comparison to MemNets

Si Similar ariti ties es:

  • MemNets and DMNs have input, scoring, attention and response mechanisms

Di Differenc nces:

  • For input representations MemNets use bag of word, nonlinear or linear

embeddings that explicitly encode position

  • MemNets iteratively run functions for attention and response
  • DM

DMNs shows hows tha hat ne neur ural seque quenc nce mode

  • dels can

n be be us used d for

  • r input

nput re repre rese sentatio ion, a , attentio ion a and re resp sponse se m mech chanism isms s

− naturally captures position and temporality

  • Enables broader range of applications

129

slide-67
SLIDE 67

Experiments: QA on babI (1k)

Task MemNN DMN Task MemNN DMN 1: Single Supporting Fact 100 100 11: Basic Coreference 100 99.9 2: Two Supporting Facts 100 98.2 12: Conjunction 100 100 3: Three Supporting facts 100 95.2 13: Compound Coreference 100 99.8 4: Two Argument Relations 100 100 14: Time Reasoning 99 100 5: Three Argument Relations 98 99.3 15: Basic Deduction 100 100 6: Yes/No Questions 100 100 16: Basic Induction 100 99.4 7: Counting 85 96.9 17: Positional Reasoning 65 59.6 8: Lists/Sets 91 96.5 18: Size Reasoning 95 95.3 9: Simple Negation 100 100 19: Path Finding 36 34.5 10: Indefinite Knowledge 98 97.5 20: Agent’s Motivations 100 100 Mean Accuracy (%) 93.3 93.6

130

slide-68
SLIDE 68

Experiments: Sentiment Analysis

131

slide-69
SLIDE 69

Experiments: Sentiment Analysis

  • Stanford SentimentTreebank
  • Test accuracies:

− MV-RNN and RNTN: Socher et al.(2013) − DCNN: Kalchbrenner et al. (2014) − PVec: Le & Mikolov. (2014) − CNN-MC: Kim (2014) − DRNN: Irsoy & Cardie (2015) − CT-LSTM: Tai et al. (2015)

132

Task Binary Fine-grained MV-RNN 82.9 44.4 RNTN 85.4 45.7 DCNN 86.8 48.5 PVec 87.8 48.7 CNN-MC 88.1 47.4 DRNN 86.6 49.8 CT-LSTM 88.0 51.0 DMN 88.6 52.1

slide-70
SLIDE 70

Analysis of Number of Episodes

  • How many attention + memory passes are needed in the episodic

memory?

133

Max passes task 3 three-facts task 7 count task 8 lists/sets sentiment (fine grain) 0 pass 48.8 33.6 50.0 1 pass 48.8 54.0 51.5 2 pass 16.7 49.1 55.6 52.1 3 pass 64.7 83.4 83.4 50.1 5 pass 95.2 96.9 96.5 N/A

slide-71
SLIDE 71

Analysis of Attention for Sentiment

  • Sharper attention

when 2 passes are allowed.

  • Examples that are

wrong with just one pass

134

slide-72
SLIDE 72

Analysis of Attention for Sentiment

135

slide-73
SLIDE 73

Analysis of Attention for Sentiment

  • Examples where full sentence context from first pass changes attention to

words more relevant for final prediction

136

slide-74
SLIDE 74

GR GRU CN CNN GR GRU

ML MLP ML MLP ML MLP ML MLP ML MLP ML MLP

m

1

m

2

Wha What kind nd of

  • f event

nt is is lik likely ly t takin ing pl place he here?

wi wine ta tastin ing <EOS

EOS>

s1 e1

1

e2

1

0.9 0.5 0.1 0.45

Visu sual Input Input M Module

  • dule

Episo sodic c Memory Module Quest stion Module Answ swer Module

Caiming Xiong

137

slide-75
SLIDE 75

138

slide-76
SLIDE 76

Attention Visualization

139

What is the main color on the bus ? Answer: blue How many pink flags are there ? Answer: 2 What type of trees are in the background ? Answer: pine Is this in the wild ? Answer: no

slide-77
SLIDE 77

Attention Visualization

140

Which man is dressed more flamboyantly ? Answer: right What time of day was this picture taken ? Answer: night picture taken ? What is the boy holding ? Answer: surfboard Who is on both photos ? Answer: girl

shown with the attention that the episodic memory

slide-78
SLIDE 78

Attention Visualization

141

What is this sculpture made out of ? Answer: metal What is the pattern on the cat ' s fur on its tail ? Answer: stripes Did the player hit the ball ? Answer: yes What color are the bananas ? Answer: green

Figure 4. Examples of qualitative results of attention for VQA. Each image (left) is shown

slide-79
SLIDE 79

Other Related Models

142

slide-80
SLIDE 80

Similar to standard Neural Nets, Controller interacts with the external world via input/output vectors

Neural Turing Machines

  • Extend the capabilities of neural nets by coupling them to external memory

resources

− Enrich RNN by a large addressable memory − Differentiable model of attention

  • Infers simple algorithms like

copying

143

Neural Turing Machines, Graves et al. 2014

slide-81
SLIDE 81

Neural Turing Machines

144

Figure 4: NTM Generalisation on the Copy Task. The four pairs of plots in the top row depict network outputs and corresponding copy targets for test sequences of length 10, 20, 30, and 50, respectively. The plots in the bottom row are for a length 120 sequence. The network was only trained on sequences of up to length 20. The first four sequences are reproduced with

slide-82
SLIDE 82

Stack Augmented RNNs

145

at = f (Aht)

st[0] = at[PUSH]σ(Dht) + at[POP]st−1[1],

st[i] = at[PUSH]st−1[i − 1] + at[POP]st−1[i + 1]. ht = σ

  • Uxt + Rht−1 + Psk

t−1

  • ,

Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets, Joulin and Mikolov, 2015

slide-83
SLIDE 83

Key-Value MemNNs

146

Key Value Memory Networks for Directly Reading Documents. Miller et al., 2016

slide-84
SLIDE 84

147

Next lecture: Representation Learning