Self-Training for Jointly Learning to Ask and Answer Questions - PDF document

Self-Training for Jointly Learning to Ask and Answer Questions Mrinmaya Sachan Eric P. Xing School of Computer Science Carnegie Mellon University { mrinmays, epxing } @cs.cmu.edu Abstract QA and QG is useful as the two can be used in con- junction to generate novel questions from free text Building curious machines that can answer as and then answers for the generated questions. We well as ask questions is an important challenge use this idea to perform self-training (Nigam and for AI. The two tasks of question answering Ghani, 2000) and leverage free text to augment the and question generation are usually tackled training of QA and QG models. separately in the NLP literature. At the same QA and QG models are typically trained on time, both require significant amounts of su- question answer pairs which are expensive to ob- pervised data which is hard to obtain in many tain in many domains. However, it is cheaper domains. To alleviate these issues, we pro- pose a self-training method for jointly learning to obtain large quantities of free text. Our self- to ask as well as answer questions, leveraging training procedure leverages unlabeled text to unlabeled text along with labeled question an- boost the quality of our QA and QG models. This swer pairs for learning. We evaluate our ap- is achieved by a careful data augmentation proce- proach on four benchmark datasets: SQUAD , dure which uses pre-trained QA and QG models to MS MARCO , WikiQA and TrecQA , and show generate additional labeled question answer pairs. significant improvements over a number of es- This additional data is then used to retrain our QA tablished baselines on both question answering and question generation tasks. We also and QG models and the procedure is repeated. achieved new state-of-the-art results on two This addition of synthetic labeled data needs competitive answer sentence selection tasks: to be performed carefully. During self-training, WikiQA and TrecQA . typically the most confident samples are added to the training set (Zhu, 2005) in each iteration. We 1 Introduction use the performance of our QA and QG models as a proxy for estimating the confidence value of Question Answering (QA) is a well-studied prob- the questions. We describe a suite of heuristics lem in NLP which focuses on answering questions inspired from curriculum learning (Bengio et al., using some structured or unstructured sources of 2009) to select the questions to be generated and knowledge. Alongside question answering, there added to the training set at each epoch. Curricu- has also been some work on generating ques- lum learning is inspired from the incremental na- tions (QG) (Heilman, 2011; Du et al., 2017; Tang ture of human learning and orders training sam- et al., 2017) which focuses on generating ques- ples on the easiness scale so that easy samples can tions based on given sources of knowledge. be introduced to the learning algorithm first and QA and QG are closely related 1 tasks. However, harder samples can be introduced successively. NLP literature views the two as entirely separate We show that introducing questions in increasing tasks. In this paper, we explore this relationship order of hardness leads to improvements over a between the two tasks by jointly learning to gen- baseline that introduces questions randomly. erate as well as answer questions. An improved We use a seq2seq model with soft attention ability to generate as well as answer questions will (Sutskever et al., 2014; Bahdanau et al., 2014) help us build curious machines that can interact for QG and a neural model inspired from Atten- with humans in a better manner. Joint modeling of tive Reader (Hermann et al., 2015; Chen et al., 1 We can think of QA and QG as inverse of each other. 2016) for QA. However, these can be any QA 629 Proceedings of NAACL-HLT 2018 , pages 629–640 New Orleans, Louisiana, June 1 - 6, 2018. c � 2018 Association for Computational Linguistics

and QG models. We evaluate our approach on each word in the passage: four datasets: SQUAD , MS MARCO , WikiQA and � � � � � p t , � � � h t = LSTM 1 , h t = LSTM 2 p t , h t − 1 h t +1 TrecQA . We use a corpus of English Wikipedia as unlabeled text. Our experiments show that The final contextual embeddings h t are given by the self-training approach leads to significant im- concatenation of the forward and backward pass provements over a number of established ap- embeddings: h t = [ � � h t ; h t ] . Similarly, we use an- proaches in QA and QG on these benchmarks. On other bi-directional LSTM and encode contextual the two answer sentence selection QA tasks: ( Wik- embeddings of each word in the question. iQA and TrecQA ), we obtain state-of-the-art. Then, we use attention mechanism (Bahdanau et al., 2014) to compute the alignment distribution 2 Problem Setup a based on the relevance among passage words q T Wh i � � and the question: a i = softmax . The In this work, we focus on the task of machine com- output vector o is a weighted combination of all prehension where the goal is to answer a question contextual embeddings: o = � a i h i . Finally, the q based on a passage p . We model this as an an- i correct answer a ∗ among the set of candidate answer sentence selection task i.e., given the set of swers A is given by: a ∗ = arg max w T o . sentences in the passage p , the task is to select a ∈A the sentence s ∈ p that contains the answer a . We learn the model by maximizing the log- Treating QA as an answer sentence selection task likelihood of correct answers. Given the training is quite common in literature (e.g. see Yu et al., set { p ( i ) , q ( i ) , a ( i ) } N i =1 , the log-likelihood is: 2014). We model QG as the task of transforming N a sentence in the passage into a question. Previ- � � � a ( i ) | p ( i ) , p ( i ) ; θ L QA = log P ous work in QG (Heilman and Smith, 2009) trans- i =1 forms text sentences into questions via some set of manually engineered rules. However, we take an Here, θ represents all the model parameters to be end-to-end neural approach. estimated. Let D 0 be a labeled dataset of (passage, ques- 4 The Question Generation Model tion, answer) triples where the answer is given by selecting a sentence in the passage. We also as- We use a seq2seq model (Sutskever et al., 2014) sume access to unlabeled text T which will be with soft attention (Bahdanau et al., 2014) as our used to augment the training of the two models. QG model. The model transduces an input sequence x to an input sequence y . Here, the in- 3 The Question Answering Model put sequence is a sentence in the passage and the output sequence is a generated question. Let Since we model QA as the task of selecting an an- x = { x 1 , x 2 , . . . , x | x | } , y = { y 1 , y 2 , . . . , y | y | } swer sentence from the passage, we treat each sen- and Y be the space of all possible output ques- tence in the corresponding passage as a candidate tions. Thus, we can represent the QG task as find- answer for every input question. ing ˆ y ∈ Y such that: ˆ y = arg max P ( y | x ) . y We employ a neural network model inspired Here, P ( y | x ) is the conditional probability of a from the Attentive Reader framework proposed in question sequence y given input sequence x . Hermann et al. (2015); Chen et al. (2016). We Decoder: Following Sutskever et al. (2014), the map all words in the vocabulary to correspond- conditional factorizes over token level predictions: ing d dimensional vector representations via an embedding matrix E ∈ R d × V . Thus, the input | y | � P ( y | x ) = P ( y t | y <t , x ) passage p can be denoted by the word sequence { p 1 , p 2 , . . . p | p | } and the question q can similarly t =1 be denoted by the word sequence { q 1 , q 2 , . . . q | q | } Here, y <t represents the subsequence of words where each token p i ∈ R d and q i ∈ R d . generated prior to the time step t . For the decoder, We use a bi-directional LSTM (Graves et al., we again follow Sutskever et al. (2014): 2005) with dropout regularization as in Zaremba � � �� W t [ h ( d ) P ( y t | y <t , x ) = softmax W tanh t ; c t ] et al. (2014) to encode contextual embeddings of 630

Self-Training for Jointly Learning to Ask and Answer Questions - PDF document

Self-Training for Jointly Learning to Ask and Answer Questions Mrinmaya Sachan Eric P. Xing School of Computer Science Carnegie Mellon University { mrinmays, epxing } @cs.cmu.edu Abstract QA and QG is useful as the two can be used in con-

ASK C o r p o r a t i o n ASK Corporation American ADM, Inc. ASK 1 C o r p o r a t i o n Ask

Jointly and the Jointly ecosystem Madeleine Starr Director of Business Development and

Abelian returns in Sturmian words S. Puzynina jointly with L. Q. Zamboni S. Puzynina jointly

Empowered Self- Belief in Awareness Self Learner Interdepen Self- -dence Motivation Self-

Exam IV Results and Solu2ons High: 98 Low: 17 Average:

Answer Projection & Extraction NLP Systems and Applications Ling573 May 15, 2014 Roadmap

Human-Computer Interaction 6. Mental Model (1) Recap: Interview: Ask More! Ask Why? and

Ask For What You Want If you dont ask, the answer is always No! Lydia Kennedy, M.Ed Director,

Ask Arthur Ask Arthur Arthurs Story Ask ArthurThe First Year Resources

Slide Handouts: Instruction Ask the Expert Welcome to Module 6 Lesson 1. Instruction: Ask the

PPP Loans For Self Employed Individuals PPP LOANS FOR SELF EMPLOYED INDIVIDUALS Self employed

Self-Driving Cars As Edge Computing Devices Matt Ranney - @mranney Uber ATG Why Self-Driving?

Harmony in the Society Self-exploration, Self-investigation, Self-study 1. Content of Self

Noise2Self: Blind Denoising by Self-Supervision Joshua Batson Loc Royer Noisy Data

SERVICE ABOVE SELF SERVICE ABOVE SELF SERVICE ABOVE SELF SERVICE ABOVE SELF ROTARY DISTRICT

SERVICE ABOVE SELF SERVICE ABOVE SELF SERVICE ABOVE SELF SERVICE ABOVE SELF ROTARY DISTRICT

Spectrum/FCC Update 01/20/2016 Criss Niemann 470 MHz More Changes Coming 470-476 14

The New Woodland Middle School State of our Union A report for WSD Board -- 3.28.16

BIB-R : a Benchmark for the Interpretation of Bibliographic Records Joffrey Decourselle, Fabien

WE WELLS LLS AND AS AND ASR WELL R WELLS S PR PROCUR CUREME EMENT NTS Pre re-sub

Stormwater needs. J A M E S WA LT O N P. E . , S T O R M WAT E R E N G I N E E R I N G S E R

Develop a GIS-based Electronic Mark Plant Circulation System in a collaborative and end-user

VOIDS IN THE VICINITY OF PIPELINES Prof Chan Lung-sang Department of Earth Sciences The

R16,3 billion, significantly higher increased by 17,6% to a 7,9% increase in general freight

Self-Training for Jointly Learning to Ask and Answer Questions - PDF document

Self-Training for Jointly Learning to Ask and Answer Questions Mrinmaya Sachan Eric P. Xing School of Computer Science Carnegie Mellon University { mrinmays, epxing } @cs.cmu.edu Abstract QA and QG is useful as the two can be used in con-

ASK C o r p o r a t i o n ASK Corporation American ADM, Inc. ASK 1 C o r p o r a t i o n Ask

Jointly and the Jointly ecosystem Madeleine Starr Director of Business Development and

Abelian returns in Sturmian words S. Puzynina jointly with L. Q. Zamboni S. Puzynina jointly

Empowered Self- Belief in Awareness Self Learner Interdepen Self- -dence Motivation Self-

Exam IV Results and Solu2ons High: 98 Low: 17 Average:

Answer Projection &amp; Extraction NLP Systems and Applications Ling573 May 15, 2014 Roadmap

Human-Computer Interaction 6. Mental Model (1) Recap: Interview: Ask More! Ask Why? and

Ask For What You Want If you dont ask, the answer is always No! Lydia Kennedy, M.Ed Director,

Ask Arthur Ask Arthur Arthurs Story Ask ArthurThe First Year Resources

Slide Handouts: Instruction Ask the Expert Welcome to Module 6 Lesson 1. Instruction: Ask the

PPP Loans For Self Employed Individuals PPP LOANS FOR SELF EMPLOYED INDIVIDUALS Self employed

Self-Driving Cars As Edge Computing Devices Matt Ranney - @mranney Uber ATG Why Self-Driving?

Harmony in the Society Self-exploration, Self-investigation, Self-study 1. Content of Self

Noise2Self: Blind Denoising by Self-Supervision Joshua Batson Loc Royer Noisy Data

SERVICE ABOVE SELF SERVICE ABOVE SELF SERVICE ABOVE SELF SERVICE ABOVE SELF ROTARY DISTRICT

SERVICE ABOVE SELF SERVICE ABOVE SELF SERVICE ABOVE SELF SERVICE ABOVE SELF ROTARY DISTRICT

Spectrum/FCC Update 01/20/2016 Criss Niemann 470 MHz More Changes Coming 470-476 14

The New Woodland Middle School State of our Union A report for WSD Board -- 3.28.16

BIB-R : a Benchmark for the Interpretation of Bibliographic Records Joffrey Decourselle, Fabien

WE WELLS LLS AND AS AND ASR WELL R WELLS S PR PROCUR CUREME EMENT NTS Pre re-sub

Stormwater needs. J A M E S WA LT O N P. E . , S T O R M WAT E R E N G I N E E R I N G S E R

Develop a GIS-based Electronic Mark Plant Circulation System in a collaborative and end-user

VOIDS IN THE VICINITY OF PIPELINES Prof Chan Lung-sang Department of Earth Sciences The

R16,3 billion, significantly higher increased by 17,6% to a 7,9% increase in general freight

Answer Projection & Extraction NLP Systems and Applications Ling573 May 15, 2014 Roadmap