Learning and Knowledge Transfer with Memory Networks for Machine - PDF document

Learning and Knowledge Transfer with Memory Networks for Machine Comprehension Mohit Yadav Lovekesh Vig Gautam Shroff TCS Research New-Delhi TCS Research New-Delhi TCS Research New-Delhi y.mohit@tcs.com lovekesh.vig@tcs.com gautam.shroff@tcs.com Abstract their inability to deal with unstructured text. Also, the next generation of search engines are aiming Enabling machines to read and compre- to provide precise and semantically relevant an- hend unstructured text remains an unful- swers in response to questions-as-queries; similar filled goal for NLP research. Recent re- to the functionality of digital assistants like Cor- search efforts on the “machine compre- tana and Siri . This will require text understanding hension” task have managed to achieve at a non-superficial level, in addition to reasoning, close to ideal performance on simulated and, making complex inferences about the text. data. However, achieving similar lev- As pointed out by Weston et al. (2016), the els of performance on small real world Question Answering (QA) task on unstructured datasets has proved difficult; major chal- text is a sound benchmark on which to evaluate lenges stem from the large vocabulary machine comprehension. The authors also intro- size, complex grammar, and the frequent duced bAbI : a simulation dataset for QA with mul- ambiguities in linguistic structure. On the tiple toy tasks. These toy tasks require a machine other hand, the requirement of human gen- to perform simple induction, deduction, multi- erated annotations for training, in order to ple chaining of facts, and, complex reasoning; ensure a sufficiently diverse set of ques- which make them a sound benchmark to measure tions is prohibitively expensive. Moti- progress towards AI-complete QA (Weston et al., vated by these practical issues, we propose 2016). The recently proposed Memory Network a novel curriculum inspired training pro- architecture and its variants have achieved close to cedure for Memory Networks to improve ideal performance, i.e., more than 95% accuracy the performance for machine comprehen- on 16 out of a total of 20 QA tasks (Sukhbaatar et sion with relatively small volumes of train- al., 2015; Weston et al., 2016). ing data. Additionally, we explore various While this performance is impressive, and is training regimes for Memory Networks to indicative of the memory network having suf- allow knowledge transfer from a closely ficient capacity for the machine comprehension related domain having larger volumes of task, the performance does not translate to real labelled data. We also suggest the use of a world text (Hill et al., 2016). Challenges in real- loss function to incorporate the asymmet- world datasets stem from the much larger vocab- ric nature of knowledge transfer. Our ex- ulary, the complex grammar, and the often am- periments demonstrate improvements on biguous linguistic structure; all of which further Dailymail, CNN, and MCTest datasets. impede high levels of generalization performance, especially with small datasets. For instance, the 1 Introduction empirical results reported by Hill et al. (2016) A long-standing goal of NLP is to imbue machines show that an end-to-end memory network with a with the ability to comprehend text and answer single hop surpasses the performance achieved us- natural language questions. The goal is still dis- ing multiple hops (i.e, higher capacity), when the tant and yet generates tremendous amount of in- model is trained with a simple heuristic. Similarly, terest due to the large number of potential NLP Tapaswi et al. (2015) show that a memory net- applications that are currently stymied because of work heavily overfits on the MovieQA dataset and 850 Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers , pages 850–859, Valencia, Spain, April 3-7, 2017. c � 2017 Association for Computational Linguistics

yields near random performance. These results The remainder of the paper is organized as fol- suggest that achieving good performance may not lows: Firstly, we provide a summary of related always be merely a matter of training high capac- work in Section 2. Next in Section 3, we describe ity models with large volumes of data. In addition the machine comprehension task and the datasets to exploring new models there is a pressing need utilized in our experiments. An introduction to for innovative training methods, especially when memory networks for machine comprehension is dealing with real world sparsely labelled datasets. presented in Section 4. Section 5 outlines the pro- With the advent of deep learning, the state of posed methods for learning and knowledge trans- art performance for various semantic NLP tasks fer. Experimental details are provided in Section has seen a significant boost (Collobert and We- 6. We summarize our conclusions in Section 7. ston, 2008). However, most of these techniques are data-hungry , and require a large number of 2 Related Work sufficiently diverse labeled training samples, e.g., Memory Networks have been successfully ap- for QA, training samples should not only encom- plied to a broad range of NLP and machine learn- pass an entire range of possible questions but also ing tasks. These tasks include but are not lim- have them in sufficient quantity (Bordes et al., ited to: performing reasoning over a simulated en- 2015). Generating annotations for training deep vironment for QA (Weston et al., 2016), factoid models requires a tremendous amount of manual and non-factoid based QA using both knowledge effort and is often too expensive. Hence, it is nec- bases and unstructured text (Kumar et al., 2015; essary to develop effective techniques to exploit Hill et al., 2016; Chandar et al., 2016; Bordes data from a related domain in order to reduce de- et al., 2015), goal driven dialog(Bordes and We- pendence on annotations. Recently, Memory Net- ston, 2016; Dodge et al., 2016; Weston, 2016), works have been successfully applied to QA and automatic story comprehension from both video dialogue-systems to work with a variety of dis- and text (Tapaswi et al., 2015), and, transferring parate data sources such as movies, images, struc- knowledge from one knowledge-base while learn- tured, and, unstructured text (Weston et al., 2016; ing to answer questions on a different knowledge Weston, 2016; Tapaswi et al., 2015; Bordes et al., base (Bordes et al., 2015). Recently, various other 2015). Inspired from the recent success of Mem- attention based neural models (similar to Memory ory Networks, we study methods to train mem- Networks) have been proposed to tackle the ma- ory networks with small datasets by allowing for chine comprehension task by QA from unstruc- knowledge transfer from related domains where tured text (Kadlec et al., 2016; Sordoni et al., labelled data is more abundantly available. 2016; Chen et al., 2016). To the best of our knowl- The focus of this paper is to improve general- edge, knowledge transfer from an unstructured ization performance of memory networks via an text dataset to another unstructured text dataset for improved learning procedure for small real-world machine comprehension is not explored yet. datasets and knowledge transfer from a related do- Training deep networks is known to be a notori- main. In the process, this paper makes the follow- ously hard problem and often the success of these ing major contributions: techniques hinges upon achieving higher gener- (i) A curriculum inspired training procedure for alization performance with high capacity models memory network is introduced, which yields (Blundell et al., 2015; Larochelle et al., 2009; Glo- superior performance with smaller datasets. rot and Bengio, 2010). To address this issue, Cur- riculum learning was firstly introduced by Ben- (ii) The exploration of knowledge transfer meth- gio et al. (2009), which showed that training with ods such as pre-training, joint-training and gradually increasing difficulty leads to a better lo- the proposed curriculum joint-training with a cal minima, specially when working with non- related domain having abundant labeled data. convex loss functions. Although devising a uni- (iii) A modified loss function for joint-training to versal curriculum strategy is hard, as even humans incorporate the asymmetric nature of knowl- do not converge to one particular order in which edge transfer, and also investigate the appli- concepts should be introduced (Rohde and Plaut, cation of a pre-trained memory network on 1999) some notion of concept difficulty is nor- very small datasets such as MCTest dataset. mally utilized. With similar motivations, this pa- 851

Learning and Knowledge Transfer with Memory Networks for Machine - PDF document

Learning and Knowledge Transfer with Memory Networks for Machine Comprehension Mohit Yadav Lovekesh Vig Gautam Shroff TCS Research New-Delhi TCS Research New-Delhi TCS Research New-Delhi y.mohit@tcs.com lovekesh.vig@tcs.com

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Knowledge Transfer Using Latent Variable Models Ayan Acharya UT Austin, Department of ECE July

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Technology Transfer or Knowledge Transfer? Russ Somma, Ph.D. SommaTech,LLC Affiliate of IPS

Industrial Transfer Learning Introduction to Industrial Transfer Learning Industrial Transfer

Radiative Transfer Radiative Transfer Radiative transfer is a branch of atmospheric physics. We

KNOWLEDGE ACQUISITION AND CONSTRUCTION Transfer of Knowledge Knowledge acquisition is the

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Outline Introduction MemNN: Memory Networks Memory Networks: General Framework MemNNs for Text

Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy

26:198:722 Expert Systems I Knowledge representation I Knowledge acquisition I Machine learning I

Town of Moraga Agenda Item Proclamations & 4. B. Presentations 1 2 Meeting Date: July

PeeringDB Update Arnold Nipper arnold@peeringdb.com 2017-11-23 DENOG9, Darmstadt, Germany 1

Australia Site Visits: A Brief Introduction To Orogenic Gold Deposits CRAIG FEEBREY Vice

2016 PROPOSED TAX LEVY SC HO O L D ISTR IC T U- 4 6 N O V EM BER 7 , 2 0 1 6 2016 TAX LEVY

Bruce Jr PS Program Area Review Team (PART) Public Meeting March 27, 2014 Toronto District

SKILLS, TRADES & CAREERS WHERE WE ARE AND WHERE WE ARE GOING Page 8 of 22 PROGRAMS:

Food Authenticity Research Network ________________________________________________________________

Country of origin labelling Perspectives and experiences from the European food and drink

Learning and Knowledge Transfer with Memory Networks for Machine - PDF document

Learning and Knowledge Transfer with Memory Networks for Machine Comprehension Mohit Yadav Lovekesh Vig Gautam Shroff TCS Research New-Delhi TCS Research New-Delhi TCS Research New-Delhi y.mohit@tcs.com lovekesh.vig@tcs.com

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Knowledge Transfer Using Latent Variable Models Ayan Acharya UT Austin, Department of ECE July

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Technology Transfer or Knowledge Transfer? Russ Somma, Ph.D. SommaTech,LLC Affiliate of IPS

Industrial Transfer Learning Introduction to Industrial Transfer Learning Industrial Transfer

Radiative Transfer Radiative Transfer Radiative transfer is a branch of atmospheric physics. We

KNOWLEDGE ACQUISITION AND CONSTRUCTION Transfer of Knowledge Knowledge acquisition is the

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Outline Introduction MemNN: Memory Networks Memory Networks: General Framework MemNNs for Text

Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy

26:198:722 Expert Systems I Knowledge representation I Knowledge acquisition I Machine learning I

Town of Moraga Agenda Item Proclamations &amp; 4. B. Presentations 1 2 Meeting Date: July

PeeringDB Update Arnold Nipper arnold@peeringdb.com 2017-11-23 DENOG9, Darmstadt, Germany 1

Australia Site Visits: A Brief Introduction To Orogenic Gold Deposits CRAIG FEEBREY Vice

2016 PROPOSED TAX LEVY SC HO O L D ISTR IC T U- 4 6 N O V EM BER 7 , 2 0 1 6 2016 TAX LEVY

Bruce Jr PS Program Area Review Team (PART) Public Meeting March 27, 2014 Toronto District

SKILLS, TRADES &amp; CAREERS WHERE WE ARE AND WHERE WE ARE GOING Page 8 of 22 PROGRAMS:

Food Authenticity Research Network ________________________________________________________________

Country of origin labelling Perspectives and experiences from the European food and drink

Town of Moraga Agenda Item Proclamations & 4. B. Presentations 1 2 Meeting Date: July

SKILLS, TRADES & CAREERS WHERE WE ARE AND WHERE WE ARE GOING Page 8 of 22 PROGRAMS: