self training for jointly learning to ask and answer
play

Self-Training for Jointly Learning to Ask and Answer Questions - PDF document

Self-Training for Jointly Learning to Ask and Answer Questions Mrinmaya Sachan Eric P. Xing School of Computer Science Carnegie Mellon University { mrinmays, epxing } @cs.cmu.edu Abstract QA and QG is useful as the two can be used in con-


  1. Self-Training for Jointly Learning to Ask and Answer Questions Mrinmaya Sachan Eric P. Xing School of Computer Science Carnegie Mellon University { mrinmays, epxing } @cs.cmu.edu Abstract QA and QG is useful as the two can be used in con- junction to generate novel questions from free text Building curious machines that can answer as and then answers for the generated questions. We well as ask questions is an important challenge use this idea to perform self-training (Nigam and for AI. The two tasks of question answering Ghani, 2000) and leverage free text to augment the and question generation are usually tackled training of QA and QG models. separately in the NLP literature. At the same QA and QG models are typically trained on time, both require significant amounts of su- question answer pairs which are expensive to ob- pervised data which is hard to obtain in many tain in many domains. However, it is cheaper domains. To alleviate these issues, we pro- pose a self-training method for jointly learning to obtain large quantities of free text. Our self- to ask as well as answer questions, leveraging training procedure leverages unlabeled text to unlabeled text along with labeled question an- boost the quality of our QA and QG models. This swer pairs for learning. We evaluate our ap- is achieved by a careful data augmentation proce- proach on four benchmark datasets: SQUAD , dure which uses pre-trained QA and QG models to MS MARCO , WikiQA and TrecQA , and show generate additional labeled question answer pairs. significant improvements over a number of es- This additional data is then used to retrain our QA tablished baselines on both question answer- ing and question generation tasks. We also and QG models and the procedure is repeated. achieved new state-of-the-art results on two This addition of synthetic labeled data needs competitive answer sentence selection tasks: to be performed carefully. During self-training, WikiQA and TrecQA . typically the most confident samples are added to the training set (Zhu, 2005) in each iteration. We 1 Introduction use the performance of our QA and QG models as a proxy for estimating the confidence value of Question Answering (QA) is a well-studied prob- the questions. We describe a suite of heuristics lem in NLP which focuses on answering questions inspired from curriculum learning (Bengio et al., using some structured or unstructured sources of 2009) to select the questions to be generated and knowledge. Alongside question answering, there added to the training set at each epoch. Curricu- has also been some work on generating ques- lum learning is inspired from the incremental na- tions (QG) (Heilman, 2011; Du et al., 2017; Tang ture of human learning and orders training sam- et al., 2017) which focuses on generating ques- ples on the easiness scale so that easy samples can tions based on given sources of knowledge. be introduced to the learning algorithm first and QA and QG are closely related 1 tasks. However, harder samples can be introduced successively. NLP literature views the two as entirely separate We show that introducing questions in increasing tasks. In this paper, we explore this relationship order of hardness leads to improvements over a between the two tasks by jointly learning to gen- baseline that introduces questions randomly. erate as well as answer questions. An improved We use a seq2seq model with soft attention ability to generate as well as answer questions will (Sutskever et al., 2014; Bahdanau et al., 2014) help us build curious machines that can interact for QG and a neural model inspired from Atten- with humans in a better manner. Joint modeling of tive Reader (Hermann et al., 2015; Chen et al., 1 We can think of QA and QG as inverse of each other. 2016) for QA. However, these can be any QA 629 Proceedings of NAACL-HLT 2018 , pages 629–640 New Orleans, Louisiana, June 1 - 6, 2018. c � 2018 Association for Computational Linguistics

  2. and QG models. We evaluate our approach on each word in the passage: four datasets: SQUAD , MS MARCO , WikiQA and � � � � � p t , � � � h t = LSTM 1 , h t = LSTM 2 p t , h t − 1 h t +1 TrecQA . We use a corpus of English Wikipedia as unlabeled text. Our experiments show that The final contextual embeddings h t are given by the self-training approach leads to significant im- concatenation of the forward and backward pass provements over a number of established ap- embeddings: h t = [ � � h t ; h t ] . Similarly, we use an- proaches in QA and QG on these benchmarks. On other bi-directional LSTM and encode contextual the two answer sentence selection QA tasks: ( Wik- embeddings of each word in the question. iQA and TrecQA ), we obtain state-of-the-art. Then, we use attention mechanism (Bahdanau et al., 2014) to compute the alignment distribution 2 Problem Setup a based on the relevance among passage words q T Wh i � � and the question: a i = softmax . The In this work, we focus on the task of machine com- output vector o is a weighted combination of all prehension where the goal is to answer a question contextual embeddings: o = � a i h i . Finally, the q based on a passage p . We model this as an an- i correct answer a ∗ among the set of candidate an- swer sentence selection task i.e., given the set of swers A is given by: a ∗ = arg max w T o . sentences in the passage p , the task is to select a ∈A the sentence s ∈ p that contains the answer a . We learn the model by maximizing the log- Treating QA as an answer sentence selection task likelihood of correct answers. Given the training is quite common in literature (e.g. see Yu et al., set { p ( i ) , q ( i ) , a ( i ) } N i =1 , the log-likelihood is: 2014). We model QG as the task of transforming N a sentence in the passage into a question. Previ- � � � a ( i ) | p ( i ) , p ( i ) ; θ L QA = log P ous work in QG (Heilman and Smith, 2009) trans- i =1 forms text sentences into questions via some set of manually engineered rules. However, we take an Here, θ represents all the model parameters to be end-to-end neural approach. estimated. Let D 0 be a labeled dataset of (passage, ques- 4 The Question Generation Model tion, answer) triples where the answer is given by selecting a sentence in the passage. We also as- We use a seq2seq model (Sutskever et al., 2014) sume access to unlabeled text T which will be with soft attention (Bahdanau et al., 2014) as our used to augment the training of the two models. QG model. The model transduces an input se- quence x to an input sequence y . Here, the in- 3 The Question Answering Model put sequence is a sentence in the passage and the output sequence is a generated question. Let Since we model QA as the task of selecting an an- x = { x 1 , x 2 , . . . , x | x | } , y = { y 1 , y 2 , . . . , y | y | } swer sentence from the passage, we treat each sen- and Y be the space of all possible output ques- tence in the corresponding passage as a candidate tions. Thus, we can represent the QG task as find- answer for every input question. ing ˆ y ∈ Y such that: ˆ y = arg max P ( y | x ) . y We employ a neural network model inspired Here, P ( y | x ) is the conditional probability of a from the Attentive Reader framework proposed in question sequence y given input sequence x . Hermann et al. (2015); Chen et al. (2016). We Decoder: Following Sutskever et al. (2014), the map all words in the vocabulary to correspond- conditional factorizes over token level predictions: ing d dimensional vector representations via an embedding matrix E ∈ R d × V . Thus, the input | y | � P ( y | x ) = P ( y t | y <t , x ) passage p can be denoted by the word sequence { p 1 , p 2 , . . . p | p | } and the question q can similarly t =1 be denoted by the word sequence { q 1 , q 2 , . . . q | q | } Here, y <t represents the subsequence of words where each token p i ∈ R d and q i ∈ R d . generated prior to the time step t . For the decoder, We use a bi-directional LSTM (Graves et al., we again follow Sutskever et al. (2014): 2005) with dropout regularization as in Zaremba � � �� W t [ h ( d ) P ( y t | y <t , x ) = softmax W tanh t ; c t ] et al. (2014) to encode contextual embeddings of 630

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend