Better Arabic Parsing Baselines, Evaluations, and Analysis Spence - - PowerPoint PPT Presentation

better arabic parsing
SMART_READER_LITE
LIVE PREVIEW

Better Arabic Parsing Baselines, Evaluations, and Analysis Spence - - PowerPoint PPT Presentation

Better Arabic Parsing Baselines, Evaluations, and Analysis Spence Green and Christopher D. Manning Stanford University August 27, 2010 Motivation Syntax and Annotation Multilingual Parsing Grammar Development Arabic Experiments Common


slide-1
SLIDE 1

Better Arabic Parsing

Baselines, Evaluations, and Analysis Spence Green and Christopher D. Manning Stanford University August 27, 2010

slide-2
SLIDE 2

Motivation Syntax and Annotation Grammar Development Experiments Multilingual Parsing Arabic

Common Multilingual Parsing Questions...

Is language X “harder” to parse than language Y?

2 / 32

slide-3
SLIDE 3

Motivation Syntax and Annotation Grammar Development Experiments Multilingual Parsing Arabic

Common Multilingual Parsing Questions...

Is language X “harder” to parse than language Y?

◮ Morphologically–rich X

2 / 32

slide-4
SLIDE 4

Motivation Syntax and Annotation Grammar Development Experiments Multilingual Parsing Arabic

Common Multilingual Parsing Questions...

Is language X “harder” to parse than language Y?

◮ Morphologically–rich X

Is treebank X “better/worse” than treebank Y?

2 / 32

slide-5
SLIDE 5

Motivation Syntax and Annotation Grammar Development Experiments Multilingual Parsing Arabic

Common Multilingual Parsing Questions...

Is language X “harder” to parse than language Y?

◮ Morphologically–rich X

Is treebank X “better/worse” than treebank Y? Does feature Z “help more” for language X than Y?

2 / 32

slide-6
SLIDE 6

Motivation Syntax and Annotation Grammar Development Experiments Multilingual Parsing Arabic

Common Multilingual Parsing Questions...

Is language X “harder” to parse than language Y?

◮ Morphologically–rich X

Is treebank X “better/worse” than treebank Y? Does feature Z “help more” for language X than Y?

◮ Lexicalization ◮ Morphological annotations ◮ Markovization ◮ etc.

2 / 32

slide-7
SLIDE 7

Motivation Syntax and Annotation Grammar Development Experiments Multilingual Parsing Arabic

“Underperformance” Relative to English

English Chinese Bulgarian German French Italian Arabic Arabic – This Paper

90.1 83.7 81.6 80.1 77.9 75.6 75.8 81.1 Evalb F1 - All Sentence Lengths

(Petrov, 2009)

3 / 32

slide-8
SLIDE 8

Motivation Syntax and Annotation Grammar Development Experiments Multilingual Parsing Arabic

Why Arabic / Penn Arabic Treebank (ATB)?

Annotation style similar to PTB Relatively little segmentation (cf. Chinese) Richer morphology (cf. English) More syntactic ambiguity (unvocalized)

4 / 32

slide-9
SLIDE 9

Motivation Syntax and Annotation Grammar Development Experiments Multilingual Parsing Arabic

ATB Details

Parts 1–3 (not including part 3, v3.2) Newswire only

◮ Agence France Presse, Al–Hayat, Al–Nahar

Corpus/experimental characteristics

◮ 23k trees ◮ 740k tokens ◮ Shortened “Bies” POS tags ◮ Split: 2005 JHU workshop

5 / 32

slide-10
SLIDE 10

Motivation Syntax and Annotation Grammar Development Experiments Multilingual Parsing Arabic

Arabic Preliminaries

Diglossia: “Arabic” − → MSA

6 / 32

slide-11
SLIDE 11

Motivation Syntax and Annotation Grammar Development Experiments Multilingual Parsing Arabic

Arabic Preliminaries

Diglossia: “Arabic” − → MSA Typology: VSO — VOS, SVO, VO also possible

6 / 32

slide-12
SLIDE 12

Motivation Syntax and Annotation Grammar Development Experiments Multilingual Parsing Arabic

Arabic Preliminaries

Diglossia: “Arabic” − → MSA Typology: VSO — VOS, SVO, VO also possible Devocalization

  • 6 / 32
slide-13
SLIDE 13

Motivation Syntax and Annotation Grammar Development Experiments Multilingual Parsing Arabic

Arabic Preliminaries

Diglossia: “Arabic” − → MSA Typology: VSO — VOS, SVO, VO also possible Devocalization

  • 6 / 32
slide-14
SLIDE 14

Motivation Syntax and Annotation Grammar Development Experiments Multilingual Parsing Arabic

Arabic Preliminaries

Diglossia: “Arabic” − → MSA Typology: VSO — VOS, SVO, VO also possible Devocalization

  • Segmentation: an analyst’s choice!

◮ ATB uses clitic segmentation

6 / 32

slide-15
SLIDE 15

Motivation Syntax and Annotation Grammar Development Experiments Multilingual Parsing Arabic

Arabic Preliminaries

Diglossia: “Arabic” − → MSA Typology: VSO — VOS, SVO, VO also possible Devocalization

  • Segmentation: an analyst’s choice!

◮ ATB uses clitic segmentation

+ +

6 / 32

slide-16
SLIDE 16

Motivation Syntax and Annotation Grammar Development Experiments Syntactic Ambiguity Analysis and Evaluation

Syntactic Ambiguity in Arabic

This talk:

◮ Devocalization ◮ Discourse level coordination ambiguity

Many other types:

◮ Adjectives / Adjective phrases ◮ Process nominals — maSdar ◮ Attachment in annexation constructs (Gabbard and Kulick, 2008)

7 / 32

slide-17
SLIDE 17

Motivation Syntax and Annotation Grammar Development Experiments Syntactic Ambiguity Analysis and Evaluation

Devocalization: Inna and her Sisters

POS Head Of

  • inna

“indeed” VBP VP

  • anna

“that” IN SBAR

  • in

“if” IN SBAR

  • an

“to” IN SBAR

8 / 32

slide-18
SLIDE 18

Motivation Syntax and Annotation Grammar Development Experiments Syntactic Ambiguity Analysis and Evaluation

Devocalization: Inna and her Sisters

POS Head Of

  • inna

“indeed” VBP VP

  • anna

“that” IN SBAR

  • in

“if” IN SBAR

  • an

“to” IN SBAR

9 / 32

slide-19
SLIDE 19

Motivation Syntax and Annotation Grammar Development Experiments Syntactic Ambiguity Analysis and Evaluation

Devocalization: Inna and her Sisters

VP VBD

ﺖﻓﺎﺿﺍ

she added S VP PUNC “ VBP

ﻥﺍ

Indeed NP NN

ﻡﺍﺪﺻ

Saddam . . .

Reference

VP VBD

ﺖﻓﺎﺿﺍ

she added SBAR PUNC “ IN

ﻥﺍ

Indeed NP NN

ﻡﺍﺪﺻ

Saddam . . .

Stanford

10 / 32

slide-20
SLIDE 20

Motivation Syntax and Annotation Grammar Development Experiments Syntactic Ambiguity Analysis and Evaluation

Discourse–level Coordination Ambiguity

S S NP VP CC

and S VP CC

and S NP PP NP

◮ S < S in 27.0% of dev set trees ◮ NP < CC in 38.7% of dev set trees

11 / 32

slide-21
SLIDE 21

Motivation Syntax and Annotation Grammar Development Experiments Syntactic Ambiguity Analysis and Evaluation

Discourse–level Coordination Ambiguity

Leaf Ancestor metric reference chains (Berkeley)

◮ score ∈ [0, 1]

Score # Gold 0.696 34 0.756 170 S < VP < NP < CC 0.768 31 0.796 86 0.804 52 S < S < VP < NP < PRP S < S < VP < S < VP < PP < IN S < S < VP < SBAR < IN S < S < NP < NN

12 / 32

slide-22
SLIDE 22

Motivation Syntax and Annotation Grammar Development Experiments Syntactic Ambiguity Analysis and Evaluation

Treebank Comparison

Compared ATB gross corpus statistics to:

◮ Chinese — CTB6 ◮ English — WSJ sect. 2-23 ◮ German — Negra

13 / 32

slide-23
SLIDE 23

Motivation Syntax and Annotation Grammar Development Experiments Syntactic Ambiguity Analysis and Evaluation

Treebank Comparison

Compared ATB gross corpus statistics to:

◮ Chinese — CTB6 ◮ English — WSJ sect. 2-23 ◮ German — Negra

The ATB isn’t that unusual!

13 / 32

slide-24
SLIDE 24

Motivation Syntax and Annotation Grammar Development Experiments Syntactic Ambiguity Analysis and Evaluation

Corpus Features in Favor of the ATB

ATB CTB6 Negra WSJ 1.04 1.18 0.46 0.82 Non-terminal / Terminal Ratio

14 / 32

slide-25
SLIDE 25

Motivation Syntax and Annotation Grammar Development Experiments Syntactic Ambiguity Analysis and Evaluation

Corpus Features in Favor of the ATB

ATB CTB6 Negra WSJ 1.04 1.18 0.46 0.82 Non-terminal / Terminal Ratio 16.8% 22.2% 30.5% 13.2% OOV Rate

14 / 32

slide-26
SLIDE 26

Motivation Syntax and Annotation Grammar Development Experiments Syntactic Ambiguity Analysis and Evaluation

Sentence Length Negatively Affects Parsing

ATB CTB6 Negra WSJ 31.5 27.7 17.2 23.8

  • Avg. Sentence Length

15 / 32

slide-27
SLIDE 27

Motivation Syntax and Annotation Grammar Development Experiments Syntactic Ambiguity Analysis and Evaluation

Sentence Length Negatively Affects Parsing

ATB CTB6 Negra WSJ 31.5 27.7 17.2 23.8

  • Avg. Sentence Length

40 words is not a sufficient limit for evaluation!

15 / 32

slide-28
SLIDE 28

Motivation Syntax and Annotation Grammar Development Experiments Features

Developing a Manually Annotated Grammar

Klein and Manning (2003)–style state splits

◮ Human–interpretable ◮ Features can inform treebank revision

16 / 32

slide-29
SLIDE 29

Motivation Syntax and Annotation Grammar Development Experiments Features

Developing a Manually Annotated Grammar

Klein and Manning (2003)–style state splits

◮ Human–interpretable ◮ Features can inform treebank revision

NP NN NP NN NP DTNN NP—idafa NN NP—idafa NN NP DTNN

16 / 32

slide-30
SLIDE 30

Motivation Syntax and Annotation Grammar Development Experiments Features

Developing a Manually Annotated Grammar

Klein and Manning (2003)–style state splits

◮ Human–interpretable ◮ Features can inform treebank revision

NP NN NP NN NP DTNN NP—idafa NN NP—idafa NN NP DTNN Alternative: automatic splits (Berkeley parser)

16 / 32

slide-31
SLIDE 31

Motivation Syntax and Annotation Grammar Development Experiments Features

Feature: markContainsVerb

S—hasVerb NP-SBJ NN NP SBAR—hasVerb IN S—hasVerb NP-TPC VP—hasVerb VB . . .

17 / 32

slide-32
SLIDE 32

Motivation Syntax and Annotation Grammar Development Experiments Features

Feature: markContainsVerb

S—hasVerb NP-SBJ NN NP SBAR—hasVerb IN S—hasVerb NP-TPC VP—hasVerb VB . . .

◮ +1.18 dev set improvement ◮ 16.1% of dev set trees lack verbs

17 / 32

slide-33
SLIDE 33

Motivation Syntax and Annotation Grammar Development Experiments Features

Feature: splitCC

S S CC—S S NP NP NN CC—noun NN . . .

18 / 32

slide-34
SLIDE 34

Motivation Syntax and Annotation Grammar Development Experiments Features

Feature: splitCC

S S CC—S S NP NP NN CC—noun NN . . .

◮ +0.21 dev set improvement ◮ POS equivalence classes for verb, noun, adjective

18 / 32

slide-35
SLIDE 35

Motivation Syntax and Annotation Grammar Development Experiments Gold Segmentation Raw Text

Gold Experimental Setup

Evaluation on lengths ≤ 70 Pre–processing makes a huge difference

◮ Maintained non–terminal / terminal ratio vv. prior

work Models:

◮ Berkeley ◮ Bikel — with pre–tagged input ◮ Stanford — with the manual grammar

19 / 32

slide-36
SLIDE 36

Motivation Syntax and Annotation Grammar Development Experiments Gold Segmentation Raw Text

Learning Curves

Evalb (development set) 75 80 85 5000 10000 15000 Berkeley Stanford Bikel training trees F1

20 / 32

slide-37
SLIDE 37

Motivation Syntax and Annotation Grammar Development Experiments Gold Segmentation Raw Text

Model Comparison

Berkeley Bikel Stanford Petrov (2009)

82.0 77.5 79.9 81.1 76.5 78.3 75.9

up to 70 All

Evalb F1

21 / 32

slide-38
SLIDE 38

Motivation Syntax and Annotation Grammar Development Experiments Gold Segmentation Raw Text

Raw Text Experimental Setup

Pipeline: MADA + Stanford parser

22 / 32

slide-39
SLIDE 39

Motivation Syntax and Annotation Grammar Development Experiments Gold Segmentation Raw Text

Raw Text Experimental Setup

Pipeline: MADA + Stanford parser Lattice parsing – effective for Hebrew (Goldberg and Tsarfaty, 2008)

22 / 32

slide-40
SLIDE 40

Motivation Syntax and Annotation Grammar Development Experiments Gold Segmentation Raw Text

Raw Text Experimental Setup

Pipeline: MADA + Stanford parser Lattice parsing – effective for Hebrew (Goldberg and Tsarfaty, 2008) Evaluation on gold (segmented) lengths ≤ 70

22 / 32

slide-41
SLIDE 41

Motivation Syntax and Annotation Grammar Development Experiments Gold Segmentation Raw Text

Raw Text Experimental Setup

Pipeline: MADA + Stanford parser Lattice parsing – effective for Hebrew (Goldberg and Tsarfaty, 2008) Evaluation on gold (segmented) lengths ≤ 70 Metric: Evalb without whitespace

◮ Requires exact character yield (Tsarfaty, 2006)

22 / 32

slide-42
SLIDE 42

Motivation Syntax and Annotation Grammar Development Experiments Gold Segmentation Raw Text

Lattice Parsing — “ABCD EFG”

ABC ABCD EFG EF G D A BC BCD

23 / 32

slide-43
SLIDE 43

Motivation Syntax and Annotation Grammar Development Experiments Gold Segmentation Raw Text

Lattice Parsing — “ABCD EFG”

ABC EFG EF G D A BC BCD A ABC BC ABCD

24 / 32

slide-44
SLIDE 44

Motivation Syntax and Annotation Grammar Development Experiments Gold Segmentation Raw Text

Lattice Parsing — “ABCD EFG”

ABC EFG EF G D A BC BCD A ABC BC BCD D ABCD EFG EF G ABCD

25 / 32

slide-45
SLIDE 45

Motivation Syntax and Annotation Grammar Development Experiments Gold Segmentation Raw Text

Parsing and Segmentation Results

Gold Pipeline Lattice Lattice+LM

81.1 79.2 74.3 76.0 Evalb F1 (Dev set)

26 / 32

slide-46
SLIDE 46

Motivation Syntax and Annotation Grammar Development Experiments Gold Segmentation Raw Text

Parsing and Segmentation Results

Gold Pipeline Lattice Lattice+LM

100.0 97.7 94.1 96.3 Segmentation Accuracy (Dev set)

27 / 32

slide-47
SLIDE 47

Motivation Syntax and Annotation Grammar Development Experiments Gold Segmentation Raw Text

Parsing and Segmentation Results

Gold Pipeline Lattice Lattice+LM

100.0 97.7 94.1 96.3 Segmentation Accuracy (Dev set) Comparable segmentation without the effort!

27 / 32

slide-48
SLIDE 48

Conclusion

English Chinese Bulgarian German French Italian Arabic Arabic – This Paper

90.1 83.7 81.6 80.1 77.9 75.6 75.8 81.1 Evalb F1 - All Sentence Lengths

28 / 32

slide-49
SLIDE 49

Thanks.

http://nlp.stanford.edu/projects/arabic.shtml

slide-50
SLIDE 50

Motivation Syntax and Annotation Grammar Development Experiments Gold Segmentation Raw Text

Mixture of Manual and Automatic Annotation

Basic Annotated 82.4 82.2 80.9 80.8

up to 70 All

Evalb F1 (Dev)

30 / 32

slide-51
SLIDE 51

Motivation Syntax and Annotation Grammar Development Experiments Gold Segmentation Raw Text

Frequency Matched Strata

Arabic English µ freq. % nuclei µ freq. % nuclei 2 2.00 46% 2.00 34% [ 3, 4 ] 3.37 26% 3.37 27% [ 5, 9 ] 6.43 16% 6.49 20% [ 10, 49 ] 18.6 10% 19.0 16% [ 50, 500 ] 110.3 2% 113 3%

31 / 32

slide-52
SLIDE 52

Motivation Syntax and Annotation Grammar Development Experiments Gold Segmentation Raw Text

Annotation Consistency Evaluation (Again!)

Nuclei Sample Error % per tree n–grams Type n–gram WSJ 2–21 0.565 750 16.0% 4.10% ATB 0.830 658 26.0% 4.10%

95% confidence intervals (type–level):

◮ Arabic: [ 17.4%, 34.6% ] ◮ English: [ 8.79%, 23.3% ]

32 / 32