Statistical Dependency Parsing in Korean: From Corpus Generation To - - PowerPoint PPT Presentation

statistical dependency parsing in korean from corpus
SMART_READER_LITE
LIVE PREVIEW

Statistical Dependency Parsing in Korean: From Corpus Generation To - - PowerPoint PPT Presentation

Statistical Dependency Parsing in Korean: From Corpus Generation To Automatic Parsing Workshop on Statistical Parsing of Morphologically-Rich Languages 12th International Conference on Parsing Technologies Jinho D. Choi & Martha Palmer


slide-1
SLIDE 1

Statistical Dependency Parsing in Korean: From Corpus Generation To Automatic Parsing

University of Colorado at Boulder October 6th, 2011 choijd@colorado.edu Workshop on Statistical Parsing of Morphologically-Rich Languages 12th International Conference on Parsing Technologies

Jinho D. Choi & Martha Palmer

Thursday, October 6, 2011

slide-2
SLIDE 2

Dependency Parsing in Korean

  • Why dependency parsing in Korean?
  • Korean is a flexible word order language.

2

S NP-SBJ AP NP-OBJ VP VP

  • VP

She still him loved

ADV OBJ SBJ SBJ ADV OBJ

SOV construction

S NP-SBJ AP NP-OBJ VP VP

  • VP

she still *T* loved S

  • Him

NP-OBJ-1

Thursday, October 6, 2011

slide-3
SLIDE 3

Dependency Parsing in Korean

  • Why dependency parsing in Korean?
  • Korean is a flexible word order language.

3

  • Rich morphology makes it easy for dependency parsing.
  • She

still him loved

SBJ OBJ ADV

그녀 !+ !는

She + Aux. particle

그 !+ !를

He + Obj. case marker

Thursday, October 6, 2011

slide-4
SLIDE 4

Dependency Parsing in Korean

  • Statistical dependency parsing in Korean
  • Sufficiently large training data is required.
  • Not much training data available for Korean dependency parsing.
  • Constituent Treebanks in Korean
  • Penn Korean Treebank: 15K sentences.
  • KAIST Treebank: 30K sentences.
  • Sejong Treebank: 60K sentences.
  • The most recent and largest Treebank in Korean.
  • Containing Penn Treebank style constituent trees.

4

Thursday, October 6, 2011

slide-5
SLIDE 5

Sejong Treebank

  • Phrase structure
  • Including phrase tags, POS tags, and function tags.
  • Each token can be broken into several morphemes.

5

()/NP+/JX /NNG+/XSV+/EP+/EF

  • /MAG

/NP+/JKO

  • !

! ! ! S NP-SBJ AP NP-OBJ VP VP

  • VP

She still him loved

Tokens are mostly separated by white spaces.

Thursday, October 6, 2011

slide-6
SLIDE 6

Sejong Treebank

6

Phrase-level tags Function tags S Sentence SBJ Subject Q Quotative clause OBJ Object NP Noun phrase CMP Complement VP Verb phrase MOD Noun modifier VNP Copula phrase AJT Predicate modifier AP Adverb phrase CNJ Conjunctive DP Adnoun phrase INT Vocative IP Interjection phrase PRN Parenthetical

NNG General noun MM Adnoun EP Prefinal EM JX Auxiliary PR NNP Proper noun MAG General adverb EF Final EM JC Conjunctive PR NNB Bound noun MAJ Conjunctive adverb EC Conjunctive EM IC Interjection NP Pronoun JKS Subjective CP ETN Nominalizing EM SN Number NR Numeral JKC Complemental CP ETM Adnominalizing EM SL Foreign word VV Verb JKG Adnomial CP XPN Noun prefix SH Chinese word VA Adjective JKO Objective CP XSN Noun DS NF Noun-like word VX Auxiliary predicate JKB Adverbial CP XSV Verb DS NV Predicate-like word VCP Copula JKV Vocative CP XSA Adjective DS NA Unknown word VCN Negation adjective JKQ Quotative CP XR Base morpheme SF, SP, SS, SE, SO, SW

  • Thursday, October 6, 2011
slide-7
SLIDE 7

Dependency Conversion

  • Conversion steps
  • Find the head of each phrase using head-percolation rules.
  • All other nodes in the phrase become dependents of the head.
  • Re-direct dependencies for empty categories.
  • Empty categories are not annotated in the Sejong Treebank.
  • Skipping this step generates only projective dependency trees.
  • Label (automatically generated) dependencies.
  • Special cases
  • Coordination, nested function tags.

7

Thursday, October 6, 2011

slide-8
SLIDE 8

Dependency Conversion

  • Head-percolation rules
  • Achieved by analyzing each phrase in the Sejong Treebank.

8

S r VP;VNP;S;NP|AP;Q;* Q l S|VP|VNP|NP;Q;* NP r NP;S;VP;VNP;AP;* VP r VP;VNP;NP;S;IP;* VNP r VNP;NP;S;* AP r AP;VP;NP;S;* DP r DP;VP;* IP r IP;VNP;* X|L|R r *

Korean is a head-final language. No rules to find the head morpheme of each token.

Thursday, October 6, 2011

slide-9
SLIDE 9

Dependency Conversion

  • Dependency labels
  • Labels retained from the function tags.
  • Labels inferred from constituent relations.

9

S NP-SBJ AP NP-OBJ VP VP

  • VP

She still him loved

ADV OBJ SBJ

  • input : (c, p), where c is a dependent of p.
  • utput: A dependency label l as c

l

← − p. begin if p = root then ROOT → l elif c.pos = AP then ADV → l elif p.pos = AP then AMOD → l elif p.pos = DP then DMOD → l elif p.pos = NP then NMOD → l elif p.pos = VP|VNP|IP then VMOD → l else DEP → l end Algorithm 1: Getting inferred labels.

  • Thursday, October 6, 2011
slide-10
SLIDE 10

Dependency Conversion

  • Coordination
  • Previous conjuncts as dependents of the following conjuncts.
  • Nested function tag
  • Nodes with nested f-tags become the heads of the phrases.

10

S NP-SBJ

  • I_and

NP-CNJ NP-SBJ NP-CNJ NP-SBJ VP NP-OBJ VP

  • he_and

she home left

OBJ SBJ CNJ CNJ

Thursday, October 6, 2011

slide-11
SLIDE 11

Dependency Parsing

  • Dependency parsing algorithm
  • Transition-based, non-projective parsing algorithm.
  • Choi & Palmer, 2011.
  • Performs transitions from both projective and non-projective

dependency parsing algorithms selectively.

  • Linear time parsing speed in practice for non-projective trees.
  • Machine learning algorithm
  • Liblinear L2-regularized L1-loss support vector.

11 Jinho D. Choi & Martha Palmer. 2011. Getting the Most out of Transition-based Dependency Parsing. In Proceedings of ACL:HLT’11

Thursday, October 6, 2011

slide-12
SLIDE 12

Dependency Parsing

  • Feature selection
  • Each token consists of multiple morphemes (up to 21).
  • POS tag feature of each token?
  • (NNG & XSV & EP & EF & SF) vs. (NNG | XSV | EP | EF | SF)
  • Sparse information vs. lack of information.

12

  • /NNP+/NNG+/JKO

!

Hodong + Prince + JKO

/NNG+/XSV+/EP+/EF+./SF !

Love + XSV + EP + EF + .

/NNP+/NNG+/JX !

Nakrang + Princess + JX Nakrang_ Hodong_

  • Happy medium?

Thursday, October 6, 2011

slide-13
SLIDE 13

Dependency Parsing

  • Morpheme selection

13

  • FS

The first morpheme LS The last morpheme before JO|DS|EM JK Particles (J* in Table 1) DS Derivational suffixes (XS* in Table 1) EM Ending markers (E* in Table 1) PY The last punctuation, only if there is no other morpheme followed by the punctuation

  • /NNP+/NNG+/JKO

Hodong + Prince + JKO

/NNG+/XSV+/EP+/EF+./SF

Love + XSV + EP + EF + .

/NNP+/NNG+/JX

Nakrang + Princess + JX

  • /NNG
  • /XSV /EF

/SF /NNP /NNG /JKO

  • /NNP /NNG /JX
  • Thursday, October 6, 2011
slide-14
SLIDE 14

Dependency Parsing

  • Feature extraction
  • Extract features using only important morphemes.
  • Individual POS tag features of the1st and 3rd tokens.

: NNP1, NNG1, JK1, NNG3, XSV3, EF3

  • Joined features of POS tags between the 1st and 3rd tokens.

: NNP1_NNG3, NNP1_XSV3, NNP1_EF3, JK1_NNG3, JK1_XSV3

  • Tokens used: wi, wj, wi±1, wj±1

14

/NNP+/NNG+/JKO

Hodong + Prince + JKO

/NNG+/XSV+/EP+/EF+./SF

Love + XSV + EP + EF + .

/NNP+/NNG+/JX

Nakrang + Princess + JX

  • /NNG
  • /XSV /EF

/SF /NNP /NNG /JKO

  • /NNP /NNG /JX
  • Thursday, October 6, 2011
slide-15
SLIDE 15

Experiments

  • Corpora
  • Dependency trees converted from the Sejong Treebank.
  • Consists of 20 sources in 6 genres.
  • Newspaper (NP), Magazine (MZ), Fiction (FI), Memoir (ME),

Informative Book (IB), and Educational Cartoon (EC).

  • Evaluation sets are very diverse compared to training sets.
  • Ensures the robustness of our parsing models.

15

NP MZ FI ME IB EC T 8,060 6,713 15,646 5,053 7,983 1,548 D 2,048

  • 2,174
  • 1,307
  • E

2,048

  • 2,175
  • 1,308
  • # of sentences in each set

Thursday, October 6, 2011

slide-16
SLIDE 16

Experiments

  • Morphological analysis
  • Two automatic morphological analyzers are used.
  • Intelligent Morphological Analyzer
  • Developed by the Sejong project.
  • Provides the same morphological analysis as their Treebank.
  • Considered as fine-grained morphological analysis.
  • Mach (Shim and

Yang, 2002)

  • Analyzes 1.3M words per second.
  • Provides more coarse-grained morphological analysis.

16 Kwangseob Shim & Jaehyung

  • Yang. 2002. A Supersonic Korean

Morphological Analyzer. In Proceedings of COLING’02

Thursday, October 6, 2011

slide-17
SLIDE 17

Experiments

  • Evaluations
  • Gold-standard vs. automatic morphological analysis.
  • Relatively low performance from the automatic system.
  • Fine vs. course-grained morphological analysis.
  • Differences are not too significant.
  • Robustness across different genres.

17

Gold, fine-grained Auto, fine-grained Auto, coarse-grained LAS UAS LS LAS UAS LS LAS UAS LS NP 82.58 84.32 94.05 79.61 82.35 91.49 79.00 81.68 91.50 FI 84.78 87.04 93.70 81.54 85.04 90.95 80.11 83.96 90.24 IB 84.21 85.50 95.82 80.45 82.14 92.73 81.43 83.38 93.89 Avg. 83.74 85.47 94.57 80.43 83.01 91.77 80.14 82.89 91.99

Thursday, October 6, 2011

slide-18
SLIDE 18

Conclusion

  • Contributions
  • Generating a Korean Dependency Treebank.
  • Selecting important morphemes for dependency parsing.
  • Evaluating the impact of fine vs. coarse-grained

morphological analysis on dependency parsing.

  • Evaluating the robustness across different genres.
  • Future work
  • Increase the feature span beyond bigrams.
  • Find head morphemes of individual tokens.
  • Insert empty categories.

18

Thursday, October 6, 2011

slide-19
SLIDE 19

Acknowledgements

  • Special thanks are due to
  • Professor Kong Joo Lee of Chungnam National University.
  • Professor Kwangseob Shim of Shungshin Women’s University.
  • We gratefully acknowledge the support of the National

Science Foundation Grants CISE-IIS-RI-0910992, Richer Representations for Machine Translation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

19

Thursday, October 6, 2011