Learning and Knowledge Transfer with Memory Networks for Machine - - PowerPoint PPT Presentation

learning and knowledge transfer with memory networks for
SMART_READER_LITE
LIVE PREVIEW

Learning and Knowledge Transfer with Memory Networks for Machine - - PowerPoint PPT Presentation

Learning and Knowledge Transfer with Memory Networks for Machine Comprehension Mohit Yadav. Lovekesh Vig. Gautam Shro ff TCS Research New-Delhi Presented by Kyo Kim April 24, 2018 Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research


slide-1
SLIDE 1

Learning and Knowledge Transfer with Memory Networks for Machine Comprehension

Mohit Yadav. Lovekesh Vig. Gautam Shroff

TCS Research New-Delhi Presented by Kyo Kim

April 24, 2018

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 1 / 27

slide-2
SLIDE 2

Overview

1

Motivation

2

Background

3

Proposed Method

4

Dataset and Experiment Results

5

Summary

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 2 / 27

slide-3
SLIDE 3

Motivation

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 3 / 27

slide-4
SLIDE 4

Problem

Obtaining high performance in ”machine comprehension” requires abundant human annotated dataset.

Measured by question answering performance.

In a real-world dataset with small amount of data, wider range of vocabulary can be observed and the grammar structure is often complex.

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 4 / 27

slide-5
SLIDE 5

High-level Overview of Proposed Method

1

Curriculum based training procedure.

2

Knowledge transfer to increase the performance in dataset with less abundant labeled data.

3

Pre-trained memory network on small dataset.

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 5 / 27

slide-6
SLIDE 6

Background

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 6 / 27

slide-7
SLIDE 7

End-to-end Memory Networks

1

Vectorize the problem tuple.

2

Retrieve the corresponding memory attention vector.

3

Use the retrieved memory to answer the question.

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 7 / 27

slide-8
SLIDE 8

End-to-end Memory Networks Cont.

Vectorize the problem tuple Problem tuple: (q, C, S, s)

q: question C: context text S: set of answer choices s: correct answer (s 2 S)

Question and context embedding matrix A 2 Rp·d

Query vector: ~ q = AΦ(q)

Φ: Bag of words

Memory vector: ~ mi = AΦ(ci) for i = 1, · · · , n where n = |C| and ci 2 C

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 8 / 27

slide-9
SLIDE 9

End-to-end Memory Networks Cont.

Retrieve the corresponding memory attention vector Attention distribution: ai = softmax( ~ m>

i ~

q). Second memory vector: ~ ri = BΦ(ci) where B is another embedding matrix similar to A. Aggregated vector: ~ ro = Pn

i=1 ai~

ri Prediction vector: ˆ ai = softmax((~ ro + ~ q)>UΦ(si))

U is the embedding matrix for the answers

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 9 / 27

slide-10
SLIDE 10

End-to-end Memory Networks Cont.

Answer the question Pick si that corresponds to the highest ˆ ai.

Cross-entropy Loss

L(P, D) = 1 ND

ND

X

n=1

 an · log(ˆ an(P, D)) + (1 an) · log(1 ˆ an(P, D))

  • Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi)

Presented by Kyo Kim April 24, 2018 10 / 27

slide-11
SLIDE 11

Curriculum Learning

First proposed by Bengio et al. (2009) Introduce samples with increasing ”difficulty”. Better local minima even under non-convex loss.

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 11 / 27

slide-12
SLIDE 12

Pre-training and Joint-training

Pre-training Have a pre-trained model to initially guide the training process in a similar domain. Joint-training Exploit the similarity between two different domains by training the model is two different domains simultaneously.

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 12 / 27

slide-13
SLIDE 13

Proposed Method

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 13 / 27

slide-14
SLIDE 14

Curriculum Inspired Training (CIT)

Difficulty Measurement

SF(q, S, C, s) = P

word2{q[S[C} log(Freq(word))

#{q [ S [ C} Partition the dataset into a fixed number chapter size with increasing difficulty. Each chapter consists of Scurrent chapter

i=1

parition[i]. The model is trained with fixed number of epochs per chapter.

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 14 / 27

slide-15
SLIDE 15

CIT Cont.

Loss Function

L(P, D, en) = 1 ND

ND

X

n=1

 (an · log(ˆ an(P, D))) + (1 an) · log(1 ˆ an(P, D)) · 1en>=c(n)·epc

  • en : Current epoch

c(n) : Chapter number that the example n is assigned to epc : Epochs per chapter

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 15 / 27

slide-16
SLIDE 16

Joint-Training

General Joint Loss Function

ˆ L(P, TD, SD) = 2 · L(P, TD) + 2(1 ) · L(P, SD) · F(NTD, NSD) TD: Target dataset SD: Source dataset ND: Number of examples in the dataset D : Tunable weight parameter

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 16 / 27

slide-17
SLIDE 17

Loss Functions

Joint-training

= 1

2 and F(NTD, NSD) = 1

ˆ L(P, TD, SD) = L(P, TD) + L(P, SD)

Weighted joint-training

2 (0, 1) and F(NTD, NSD) = NTD

NSD .

ˆ L(P, TD, SD) = 2 ·L(P, TD)+2(1)·L(P, SD)· NTD NSD

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 17 / 27

slide-18
SLIDE 18

Loss Functions Cont.

Curriculum joint-training

= 1

2 and F(NTD, NSD) = 1

ˆ L(P, TD, SD) = L(P, TD, en) + L(P, SD, en)

Weighted Curriculum joint-training

2 (0, 1) and F(NTD, NSD) = NTD

NSD .

ˆ L(P, TD, SD) = 2 · L(P, TD, en) + 2(1 )L(P, SD, en) · NTD NSD

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 18 / 27

slide-19
SLIDE 19

Source only

= 0 and c 2 R+ ˆ L(P, TD, SD) = c · L(P, SD)

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 19 / 27

slide-20
SLIDE 20

Dataset and Experiment Results

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 20 / 27

slide-21
SLIDE 21

Dataset

Figure: Dataset used for experiments.

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 21 / 27

slide-22
SLIDE 22

Experiment Results

Figure: The table has two major rows. The upper row are models that only used the target dataset. The lower rows are models that used both the target and source dataset.

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 22 / 27

slide-23
SLIDE 23

Experiment Results

Figure: Categorical performance measurement in CNN-11 K . The table has two major rows. The upper row are models that only used the target dataset. The lower rows are models that used both the target and source dataset.

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 23 / 27

slide-24
SLIDE 24

Experiment Results

Figure: Knowledge transfer performance result.

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 24 / 27

slide-25
SLIDE 25

Experiment Results

Figure: Loss convergence comparison between model trained with CIT and without CIT.

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 25 / 27

slide-26
SLIDE 26

Summary

MemNN is often used in QA. Ordering the samples lead to better local minima. Joint-training is useful in obtaining better performance on small target dataset. Using pre-trained model improves performance.

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 26 / 27

slide-27
SLIDE 27

The End

Mohit Yadav. Lovekesh Vig. Gautam Shroff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 27 / 27