Learning and Knowledge Transfer with Memory Networks for Machine - PowerPoint PPT Presentation

Learning and Knowledge Transfer with Memory Networks for Machine Comprehension Mohit Yadav. Lovekesh Vig. Gautam Shro ff TCS Research New-Delhi Presented by Kyo Kim April 24, 2018 Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 1 / 27

Overview Motivation 1 Background 2 Proposed Method 3 Dataset and Experiment Results 4 Summary 5 Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 2 / 27

Motivation Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 3 / 27

Problem Obtaining high performance in ”machine comprehension” requires abundant human annotated dataset. Measured by question answering performance. In a real-world dataset with small amount of data, wider range of vocabulary can be observed and the grammar structure is often complex. Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 4 / 27

High-level Overview of Proposed Method Curriculum based training procedure. 1 Knowledge transfer to increase the performance in 2 dataset with less abundant labeled data. Pre-trained memory network on small dataset. 3 Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 5 / 27

Background Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 6 / 27

End-to-end Memory Networks Vectorize the problem tuple. 1 Retrieve the corresponding memory attention vector. 2 Use the retrieved memory to answer the question. 3 Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 7 / 27

End-to-end Memory Networks Cont. Vectorize the problem tuple Problem tuple: ( q , C , S , s ) q : question C : context text S : set of answer choices s : correct answer ( s 2 S ) Question and context embedding matrix A 2 R p · d Query vector: ~ q = A Φ ( q ) Φ : Bag of words Memory vector: ~ m i = A Φ ( c i ) for i = 1 , · · · , n where n = | C | and c i 2 C Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 8 / 27

End-to-end Memory Networks Cont. Retrieve the corresponding memory attention vector m > Attention distribution: a i = softmax ( ~ q ). i ~ Second memory vector: ~ r i = B Φ ( c i ) where B is another embedding matrix similar to A . r o = P n Aggregated vector: ~ i =1 a i ~ r i q ) > U Φ ( s i )) Prediction vector: ˆ a i = softmax (( ~ r o + ~ U is the embedding matrix for the answers Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 9 / 27

End-to-end Memory Networks Cont. Answer the question Pick s i that corresponds to the highest ˆ a i . Cross-entropy Loss N D L ( P , D ) = 1  X a n · log (ˆ a n ( P , D )) N D n =1 � + (1 � a n ) · log (1 � ˆ a n ( P , D )) Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 10 / 27

Curriculum Learning First proposed by Bengio et al. (2009) Introduce samples with increasing ”di ffi culty”. Better local minima even under non-convex loss. Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 11 / 27

Pre-training and Joint-training Pre-training Have a pre-trained model to initially guide the training process in a similar domain. Joint-training Exploit the similarity between two di ff erent domains by training the model is two di ff erent domains simultaneously. Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 12 / 27

Proposed Method Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 13 / 27

Curriculum Inspired Training (CIT) Di ffi culty Measurement P word 2 { q [ S [ C } log ( Freq ( word )) SF ( q , S , C , s ) = # { q [ S [ C } Partition the dataset into a fixed number chapter size with increasing di ffi culty. Each chapter consists of S current chapter parition [ i ]. i =1 The model is trained with fixed number of epochs per chapter. Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 14 / 27

CIT Cont. Loss Function N D L ( P , D , en ) = 1  X ( a n · log (ˆ a n ( P , D ))) N D n =1 � + (1 � a n ) · log (1 � ˆ a n ( P , D )) · 1 en > = c ( n ) · epc en : Current epoch c ( n ) : Chapter number that the example n is assigned to epc : Epochs per chapter Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 15 / 27

Joint-Training General Joint Loss Function ˆ L ( P , TD , SD ) = 2 � · L ( P , TD ) + 2(1 � � ) · L ( P , SD ) · F ( N TD , N SD ) TD : Target dataset SD : Source dataset N D : Number of examples in the dataset D � : Tunable weight parameter Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 16 / 27

Loss Functions Joint-training � = 1 2 and F ( N TD , N SD ) = 1 ˆ L ( P , TD , SD ) = L ( P , TD ) + L ( P , SD ) Weighted joint-training � 2 (0 , 1) and F ( N TD , N SD ) = N TD N SD . L ( P , TD , SD ) = 2 � · L ( P , TD )+2(1 � � ) · L ( P , SD ) · N TD ˆ N SD Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 17 / 27

Loss Functions Cont. Curriculum joint-training � = 1 2 and F ( N TD , N SD ) = 1 ˆ L ( P , TD , SD ) = L ( P , TD , en ) + L ( P , SD , en ) Weighted Curriculum joint-training � 2 (0 , 1) and F ( N TD , N SD ) = N TD N SD . ˆ L ( P , TD , SD ) = 2 � · L ( P , TD , en ) + 2(1 � � ) L ( P , SD , en ) · N TD N SD Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 18 / 27

Source only � = 0 and c 2 R + ˆ L ( P , TD , SD ) = c · L ( P , SD ) Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 19 / 27

Dataset and Experiment Results Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 20 / 27

Dataset Figure: Dataset used for experiments. Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 21 / 27

Experiment Results Figure: The table has two major rows. The upper row are models that only used the target dataset. The lower rows are models that used both the target and source dataset. Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 22 / 27

Experiment Results Figure: Categorical performance measurement in CNN-11 K . The table has two major rows. The upper row are models that only used the target dataset. The lower rows are models that used both the target and source dataset. Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 23 / 27

Experiment Results Figure: Knowledge transfer performance result. Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 24 / 27

Experiment Results Figure: Loss convergence comparison between model trained with CIT and without CIT. Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 25 / 27

Summary MemNN is often used in QA. Ordering the samples lead to better local minima. Joint-training is useful in obtaining better performance on small target dataset. Using pre-trained model improves performance. Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 26 / 27

The End Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research New-Delhi) Presented by Kyo Kim April 24, 2018 27 / 27

Learning and Knowledge Transfer with Memory Networks for Machine - PowerPoint PPT Presentation

Learning and Knowledge Transfer with Memory Networks for Machine Comprehension Mohit Yadav. Lovekesh Vig. Gautam Shro ff TCS Research New-Delhi Presented by Kyo Kim April 24, 2018 Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Knowledge Transfer Using Latent Variable Models Ayan Acharya UT Austin, Department of ECE July

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Technology Transfer or Knowledge Transfer? Russ Somma, Ph.D. SommaTech,LLC Affiliate of IPS

Industrial Transfer Learning Introduction to Industrial Transfer Learning Industrial Transfer

Radiative Transfer Radiative Transfer Radiative transfer is a branch of atmospheric physics. We

KNOWLEDGE ACQUISITION AND CONSTRUCTION Transfer of Knowledge Knowledge acquisition is the

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Outline Introduction MemNN: Memory Networks Memory Networks: General Framework MemNNs for Text

Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy

26:198:722 Expert Systems I Knowledge representation I Knowledge acquisition I Machine learning I

!" !" """""""" *(+,-./0''1@A!8B!''-11<=-B@''

Cost Allocation 101 Webinar June 6, 2018, 2:00-3:30 PM ET U.S. Department of Transportation

Transit Database June 2018 Transit Asset Management First compliant plans Oct 1, 2018

Improvements to the Co-simulation Interface for Geographically Distributed Real-time Simulation

Vasiliki A. Mitsou for the MoEDAL Collabora1on Interna2onal Conference on Exo2c Atoms and Related

Innovating and Managing Transit in the Era of New Mobility Adrian Pearmine A.J. OConnor

Sporadic points on modular curves David Zureick-Brown (Emory University) Anastassia Etropolski

TREC 2005 Video Retrieval Evaluation Introductions Paul Over* Wessel Kraaij (TNO ICT) Tzveta

Learning and Knowledge Transfer with Memory Networks for Machine - PowerPoint PPT Presentation

Learning and Knowledge Transfer with Memory Networks for Machine Comprehension Mohit Yadav. Lovekesh Vig. Gautam Shro ff TCS Research New-Delhi Presented by Kyo Kim April 24, 2018 Mohit Yadav. Lovekesh Vig. Gautam Shro ff (TCS Research

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Knowledge Transfer Using Latent Variable Models Ayan Acharya UT Austin, Department of ECE July

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Technology Transfer or Knowledge Transfer? Russ Somma, Ph.D. SommaTech,LLC Affiliate of IPS

Industrial Transfer Learning Introduction to Industrial Transfer Learning Industrial Transfer

Radiative Transfer Radiative Transfer Radiative transfer is a branch of atmospheric physics. We

KNOWLEDGE ACQUISITION AND CONSTRUCTION Transfer of Knowledge Knowledge acquisition is the

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Outline Introduction MemNN: Memory Networks Memory Networks: General Framework MemNNs for Text

Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy

26:198:722 Expert Systems I Knowledge representation I Knowledge acquisition I Machine learning I

!&quot; !&quot; &quot;&quot;&quot;&quot;&quot;&quot;&quot;&quot; *(+,-./0''1@A!8B!''-11&lt;=-B@''

Cost Allocation 101 Webinar June 6, 2018, 2:00-3:30 PM ET U.S. Department of Transportation

Transit Database June 2018 Transit Asset Management First compliant plans Oct 1, 2018

Improvements to the Co-simulation Interface for Geographically Distributed Real-time Simulation

Vasiliki A. Mitsou for the MoEDAL Collabora1on Interna2onal Conference on Exo2c Atoms and Related

Innovating and Managing Transit in the Era of New Mobility Adrian Pearmine A.J. OConnor

Sporadic points on modular curves David Zureick-Brown (Emory University) Anastassia Etropolski

TREC 2005 Video Retrieval Evaluation Introductions Paul Over* Wessel Kraaij (TNO ICT) Tzveta

!" !" """""""" *(+,-./0''1@A!8B!''-11<=-B@''