improving domain independent question parsing with
play

Improving Domain Independent Question Parsing with Synthetic - PowerPoint PPT Presentation

Improving Domain Independent Question Parsing with Synthetic Treebanks COLING 2018: LAW-MWE-CxG Halim-Antoine Boukaram , Nizar Habash, Micheline Ziadee, and Majd Sakr American University of Science and Technology, Lebanon New York


  1. Improving Domain Independent Question Parsing with Synthetic Treebanks COLING 2018: LAW-MWE-CxG Halim-Antoine Boukaram , Nizar Habash, † Micheline Ziadee, and Majd Sakr ‡ American University of Science and Technology, Lebanon † New York University Abu Dhabi, UAE ‡ Carnegie Mellon University, USA {hboukaram,mziadee}@aust.edu.lb, nizar.habash@nyu.edu, msakr@cs.cmu.edu

  2. Problem & Solution ● Automatic parsers do not perform well on question constructions ○ Most treebanks used for training are in the news domain which lacks question constructions ● Our proposed solution is to synthetically create syntactic trees of questions on which to train parsers ● We present our results on Standard Arabic, a morphologically rich and relatively low-resource language 1

  3. Example of Question Parsing Errors To where do I go to submit the application? ؟ بلطلا مداق هجا نًا ىلا Automatically Parsed Human Parsed 2

  4. Example of Question Parsing Errors What time will the celebration start? ؟ لافتاقا اهيف أدبيس ةعاس يأ Automatically Parsed Human Parsed 3

  5. Research Questions ● We explore two effective and low-cost techniques to add more annotated questions to the training corpus ○ Automatically Generating Questions from Existing Treebanks ○ Automatically Generating Questions from Question Templates ● Research questions: ○ How do these techniques compare with manual annotation of additional questions? ○ Do combinations of synthetic and manual data improve accuracy? 4

  6. Technique #1: QGen ● Automatically transform an annotated sentence into a number of annotated questions (4.75 on average) ○ (S (NP-SPJ the boy) (VP ate (NP-OBJ the apple))) ○ → (SBARQ (WHNP who) (S (VP ate (NP-OBJ the apple)))) ○ → (SQ (VP did) (NP-SPJ the boy) (VP eat (NP-OBJ the apple))) 5

  7. Technique #1: QGen ● Words of the input tree are modified morphologically depending on the type of generated question ○ Arabic Who questions ■ Sentences with gender- and number-specific verbs → Questions with masculine-singular verbs 6

  8. QGen Examples Original phrase structure Simple SQ Structure 7

  9. QGen Examples Original phrase structure Simple SBARQ Structure 8

  10. QGen Examples Original phrase structure Modified SQ Structure 9

  11. QGen Examples Original phrase structure Modified SBARQ Structure 10

  12. QGen Examples Original phrase structure Modified SBARQ Structure (who) 11

  13. Limitations of QGen ● Errors in the resulting synthetic data due to overgeneration resulting in nonsensical synthetic questions ● Limited coverage of modeled question structures ● Input domain might be different from desired question domain 12

  14. Technique #2: QTemp ● Generate question templates ● Fill the template by filling the in a desired domain placeholder elements ○ Where is %place%? ○ Where is the bathroom? ● Annotate question templates ○ Where is a bathroom? ○ Where is the dean’s office? ○ Where is the finance office? ○ Where is ...? 13

  15. QTemp Examples Annotated question template Annotated token Annotated question + = 14

  16. QTemp Examples Annotated question template Annotated token Annotated question + = 15

  17. Experimental Setup ● Baseline Treebank is Penn Arabic Treebank (PATB) ● Two Synthetic Treebanks ○ QGen and QTemp ● Two Manually Annotated Treebanks ○ TalkShow and Chatbot ● Test accuracy of parser trained using: ○ Synthetic vs Manual ○ Combined vs Synthetic or Manual 16

  18. Data Sets Treebank Domain Train # Sentences (# Words) Test # Sentences (# Words) PATB (part3) News articles 10,836 (320,998) 794 (12,884) PATBQ News articles N/A 67 (1,054) TalkShow Political talk show 544 (2,691) 143 (692) Chatbot Conversational 239 (1,505) 62 (441) QGEN PATB News articles (Synthetic) 962 (8,140) N/A QTemp Conversational (Synthetic) 1,607 (13,099) N/A 17

  19. Results Corpus Baseline +Synthetic Manual All PATB Train QGEN PATB + QTemp TalkShow + Chatbot PATB 80.6 80.6 80.6 80.9 PATBQ 73.8 74.0 74.9 75.9 Test TalkShow 88.2 87.3 91.4 92.9 Chatbot 90.5 93.6 93.3 94.1 Macro Average Q 84.2 84.9 86.5 87.6 18

  20. Results Corpus Baseline +Synthetic Manual All PATB Train QGEN PATB + QTemp TalkShow + Chatbot PATB 80.6 80.6 80.6 80.9 PATBQ 73.8 74.0 74.9 75.9 Test TalkShow 88.2 87.3 91.4 92.9 Chatbot 90.5 93.6 93.3 94.1 Macro Average Q 84.2 84.9 86.5 87.6 19

  21. Results Corpus Baseline +Synthetic Manual All PATB Train QGEN PATB + QTemp TalkShow + Chatbot PATB 80.6 80.6 80.6 80.9 PATBQ 73.8 74.0 74.9 75.9 Test TalkShow 88.2 87.3 91.4 92.9 Chatbot 90.5 93.6 93.3 94.1 Macro Average Q 84.2 84.9 86.5 87.6 20

  22. Results Corpus Baseline +Synthetic Manual All PATB Train QGEN PATB + QTemp TalkShow + Chatbot PATB 80.6 80.6 80.6 80.9 PATBQ 73.8 74.0 74.9 75.9 Test TalkShow 88.2 87.3 91.4 92.9 Chatbot 90.5 93.6 93.3 94.1 Macro Average Q 84.2 84.9 86.5 87.6 21

  23. Results Corpus Baseline +Synthetic Manual All PATB Train QGEN PATB + QTemp TalkShow + Chatbot PATB 80.6 80.6 80.6 80.9 PATBQ 73.8 74.0 74.9 75.9 Test TalkShow 88.2 87.3 91.4 92.9 Chatbot 90.5 93.6 93.3 94.1 Macro Average Q 84.2 84.9 86.5 87.6 22

  24. Conclusions and Future Work ● Synthetic question treebanks are useful for improving question parsing ● The domain of the synthetic treebanks must match the desired domain of questions we are interested in parsing ● We will investigate how applicable the synthetic techniques are to other languages ● We will write more question generating procedures ● The Manual and Synthetic Treebanks will be published through the Linguistic Data Consortium 23

  25. Thank You

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend