Improving Domain Independent Question Parsing with Synthetic - - PowerPoint PPT Presentation

improving domain independent question parsing with
SMART_READER_LITE
LIVE PREVIEW

Improving Domain Independent Question Parsing with Synthetic - - PowerPoint PPT Presentation

Improving Domain Independent Question Parsing with Synthetic Treebanks COLING 2018: LAW-MWE-CxG Halim-Antoine Boukaram , Nizar Habash, Micheline Ziadee, and Majd Sakr American University of Science and Technology, Lebanon New York


slide-1
SLIDE 1

Improving Domain Independent Question Parsing with Synthetic Treebanks

COLING 2018: LAW-MWE-CxG Halim-Antoine Boukaram, Nizar Habash,† Micheline Ziadee, and Majd Sakr‡

American University of Science and Technology, Lebanon

†New York University Abu Dhabi, UAE ‡Carnegie Mellon University, USA

{hboukaram,mziadee}@aust.edu.lb, nizar.habash@nyu.edu, msakr@cs.cmu.edu

slide-2
SLIDE 2

Problem & Solution

  • Automatic parsers do not perform well on question

constructions

○ Most treebanks used for training are in the news domain which lacks question constructions

  • Our proposed solution is to synthetically create

syntactic trees of questions on which to train parsers

  • We present our results on Standard Arabic, a

morphologically rich and relatively low-resource language

1

slide-3
SLIDE 3

Example of Question Parsing Errors

Automatically Parsed To where do I go to submit the application? Human Parsed

؟ بلطلا مداق هجا نًا ىلا

2

slide-4
SLIDE 4

Example of Question Parsing Errors

What time will the celebration start? Automatically Parsed Human Parsed

؟ لافتاقا اهيف أدبيس ةعاس يأ

3

slide-5
SLIDE 5

Research Questions

  • We explore two effective and low-cost techniques to

add more annotated questions to the training corpus

○ Automatically Generating Questions from Existing Treebanks ○ Automatically Generating Questions from Question Templates

  • Research questions:

○ How do these techniques compare with manual annotation of additional questions? ○ Do combinations of synthetic and manual data improve accuracy?

4

slide-6
SLIDE 6

Technique #1: QGen

  • Automatically transform an annotated sentence into a

number of annotated questions (4.75 on average)

○ (S (NP-SPJ the boy) (VP ate (NP-OBJ the apple))) ○ → (SBARQ (WHNP who) (S (VP ate (NP-OBJ the apple)))) ○ → (SQ (VP did) (NP-SPJ the boy) (VP eat (NP-OBJ the apple)))

5

slide-7
SLIDE 7

Technique #1: QGen

  • Words of the input tree are modified morphologically

depending on the type of generated question

○ Arabic Who questions ■ Sentences with gender- and number-specific verbs → Questions with masculine-singular verbs

6

slide-8
SLIDE 8

QGen Examples

Original phrase structure Simple SQ Structure 7

slide-9
SLIDE 9

QGen Examples

Original phrase structure Simple SBARQ Structure 8

slide-10
SLIDE 10

QGen Examples

Original phrase structure Modified SQ Structure 9

slide-11
SLIDE 11

QGen Examples

Original phrase structure Modified SBARQ Structure 10

slide-12
SLIDE 12

QGen Examples

Original phrase structure Modified SBARQ Structure (who) 11

slide-13
SLIDE 13

Limitations of QGen

  • Errors in the resulting synthetic data due to
  • vergeneration resulting in nonsensical synthetic

questions

  • Limited coverage of modeled question structures
  • Input domain might be different from desired question

domain

12

slide-14
SLIDE 14

Technique #2: QTemp

  • Generate question templates

in a desired domain ○ Where is %place%?

  • Annotate question templates
  • Fill the template by filling the

placeholder elements ○ Where is the bathroom? ○ Where is a bathroom? ○ Where is the dean’s office? ○ Where is the finance office? ○ Where is ...?

13

slide-15
SLIDE 15

QTemp Examples

+ = Annotated question template Annotated token Annotated question 14

slide-16
SLIDE 16

QTemp Examples

Annotated question template Annotated token Annotated question + = 15

slide-17
SLIDE 17

Experimental Setup

  • Baseline Treebank is Penn Arabic Treebank (PATB)
  • Two Synthetic Treebanks

○ QGen and QTemp

  • Two Manually Annotated Treebanks

○ TalkShow and Chatbot

  • Test accuracy of parser trained using:

○ Synthetic vs Manual ○ Combined vs Synthetic or Manual

16

slide-18
SLIDE 18

Data Sets

Treebank Domain Train # Sentences (# Words) Test # Sentences (# Words) PATB (part3) News articles 10,836 (320,998) 794 (12,884) PATBQ News articles N/A 67 (1,054) TalkShow Political talk show 544 (2,691) 143 (692) Chatbot Conversational 239 (1,505) 62 (441) QGENPATB News articles (Synthetic) 962 (8,140) N/A QTemp Conversational (Synthetic) 1,607 (13,099) N/A 17

slide-19
SLIDE 19

Results

Corpus Baseline +Synthetic Manual All Train PATB QGENPATB + QTemp TalkShow + Chatbot Test PATB 80.6 80.6 80.6 80.9 PATBQ 73.8 74.0 74.9 75.9 TalkShow 88.2 87.3 91.4 92.9 Chatbot 90.5 93.6 93.3 94.1 Macro Average Q 84.2 84.9 86.5 87.6 18

slide-20
SLIDE 20

Results

Corpus Baseline +Synthetic Manual All Train PATB QGENPATB + QTemp TalkShow + Chatbot Test PATB 80.6 80.6 80.6 80.9 PATBQ 73.8 74.0 74.9 75.9 TalkShow 88.2 87.3 91.4 92.9 Chatbot 90.5 93.6 93.3 94.1 Macro Average Q 84.2 84.9 86.5 87.6 19

slide-21
SLIDE 21

Results

Corpus Baseline +Synthetic Manual All Train PATB QGENPATB + QTemp TalkShow + Chatbot Test PATB 80.6 80.6 80.6 80.9 PATBQ 73.8 74.0 74.9 75.9 TalkShow 88.2 87.3 91.4 92.9 Chatbot 90.5 93.6 93.3 94.1 Macro Average Q 84.2 84.9 86.5 87.6 20

slide-22
SLIDE 22

Results

Corpus Baseline +Synthetic Manual All Train PATB QGENPATB + QTemp TalkShow + Chatbot Test PATB 80.6 80.6 80.6 80.9 PATBQ 73.8 74.0 74.9 75.9 TalkShow 88.2 87.3 91.4 92.9 Chatbot 90.5 93.6 93.3 94.1 Macro Average Q 84.2 84.9 86.5 87.6 21

slide-23
SLIDE 23

Results

Corpus Baseline +Synthetic Manual All Train PATB QGENPATB + QTemp TalkShow + Chatbot Test PATB 80.6 80.6 80.6 80.9 PATBQ 73.8 74.0 74.9 75.9 TalkShow 88.2 87.3 91.4 92.9 Chatbot 90.5 93.6 93.3 94.1 Macro Average Q 84.2 84.9 86.5 87.6 22

slide-24
SLIDE 24

Conclusions and Future Work

  • Synthetic question treebanks are useful for improving

question parsing

  • The domain of the synthetic treebanks must match the

desired domain of questions we are interested in parsing

  • We will investigate how applicable the synthetic techniques

are to other languages

  • We will write more question generating procedures
  • The Manual and Synthetic Treebanks will be published

through the Linguistic Data Consortium

23

slide-25
SLIDE 25

Thank You