CroMo - Morphological Analysis for Croatian Model Evaluation - - PowerPoint PPT Presentation

cromo morphological analysis for croatian
SMART_READER_LITE
LIVE PREVIEW

CroMo - Morphological Analysis for Croatian Model Evaluation - - PowerPoint PPT Presentation

CroMo D. avar Outline Introduction CroMo - Morphological Analysis for Croatian Model Evaluation Comments Damir avar 1 , Ivo-Pavao Jazbec 2 and Tomislav Stojanov 2 Linguistics Department, University of Zadar 1 Institute of Croatian


slide-1
SLIDE 1

CroMo

  • D. Ćavar

Outline Introduction Model Evaluation Comments

CroMo - Morphological Analysis for Croatian

Damir Ćavar1, Ivo-Pavao Jazbec2 and Tomislav Stojanov2

Linguistics Department, University of Zadar1 Institute of Croatian Language and Linguistics2

FSMNLP 2008

slide-2
SLIDE 2

CroMo

  • D. Ćavar

Outline Introduction Model Evaluation Comments

1

Introduction

2

Model

3

Evaluation

4

Comments

slide-3
SLIDE 3

CroMo

  • D. Ćavar

Outline Introduction Model Evaluation Comments

Scenario

Synchronic and diachronic study of language change and acquisition models

Language data from a long period of time, and three major dialects in Croatia implying:

Variation wrt. e.g. string-based morphology or feature bundles Ongoing discovery wrt. string combinatorics and features

Research questions require quantitative and qualitative information:

  • f phonological, morphological, syntactic and semantic

tokens and feature bundles, and their correlation and variation at various stages over time

slide-4
SLIDE 4

CroMo

  • D. Ćavar

Outline Introduction Model Evaluation Comments

Morphological segmentation and annotation

and lemmatization, and . . .

Segmenting words:

isponapijali su se “they got drunk a little bit to satisfaction” is – po – napija – li

Annotating segments:

aspect prefix – aspect prefix – from stem-lemma napiti – plural participle

Extending the annotation:

to a certain saturation – a little bit – “get drunk” from root-lemma piti – past event

slide-5
SLIDE 5

CroMo

  • D. Ćavar

Outline Introduction Model Evaluation Comments

FSA Architecture

Mapping of morph-groups to DFSAs (Mealy or Moore machine):

1 č p 5 š v root (-index 2 i e 3 t 4 a v root )-index

1 n 3 p v pref (-index 2 a 4

  • v pref )-index asp

v pref )-index asp 2 m 3 š 1 ε 4 t 6 j v suf (-index 8

  • v suf )-index pres 1st sg

v suf )-index pres 2st sg v suf )-index pres 3rd sg 5 e v suf )-index 2nd sg imper 7 u v suf )-index pres 1st pl v suf )-index pres 2nd pl v suf )-index pres 3rd pl

slide-6
SLIDE 6

CroMo

  • D. Ćavar

Outline Introduction Model Evaluation Comments

FSA Architecture

Mapping ambiguity on emission: emission tuple 1 to n Label DFSAs with variable names Use rules referring to variable names for modeling of morphotactic regularities: verbAspectPrefs* . verbAtiRoots . verbInflSuf

slide-7
SLIDE 7

CroMo

  • D. Ćavar

Outline Introduction Model Evaluation Comments

FSA Architecture

Generating potentially cyclic DFSAs:

3 p 1 n 5 ε v pref (-index 4

  • 2

a ε 6 č p 8 š v root (-index ε v pref )-index asp ε v pref )-index asp 7 i e 9 t 10 a v root )-index 11 ε 13 m 14 š 12 ε 15 t 17 j v suf (-index 19

  • v suf )-index pres 1st sg

v suf )-index pres 2st sg v suf )-index pres 3rd sg 16 e v suf )-index 2nd sg imper 18 u v suf )-index pres 1st pl v suf )-index pres 2nd pl v suf )-index pres 3rd pl

slide-8
SLIDE 8

CroMo

  • D. Ćavar

Outline Introduction Model Evaluation Comments

FSA Architecture

Ambiguity mapped on emission tuple:

1 g [ v root (-index; adj root (-index ] 2

  • 3

r 4 e [ v root )-index; adj root )-index ] [ v suff (-index; adj suff (-index ] [ v suff )-index; adj suff )-index ]

slide-9
SLIDE 9

CroMo

  • D. Ćavar

Outline Introduction Model Evaluation Comments

FSA Architecture

Lemmatization as a rule: Rightmost root is the semantic head Root-lemma: generate canonical word-form from the right-most root neprijatelja → ne + prijatelj + a → NEG + N-root + ACC “not friend” = “enemy” ⇒ ¬friend not compositional! but useful for semantic field analysis! root-lemma: neprijatelja → prijatelj Stem/base-lemma: generate canonical word-form from the stem without inflectional suffixes base-lemma: neprijatelja → neprijatelj

slide-10
SLIDE 10

CroMo

  • D. Ćavar

Outline Introduction Model Evaluation Comments

FSA Architecture

Lemmatization (Hack): emission of byte-offset for suffix-elimination pointer to suffix string Clean solution:

1 g:(g, (FV, ...)) 2

  • :(o, ())

3 r:(r, ()) 4 e:(a, (FV, ...))

slide-11
SLIDE 11

CroMo

  • D. Ćavar

Outline Introduction Model Evaluation Comments

Implementation

C++ wrapper for final application Ragel code (automaton definition) generated from morpheme DBs and rules, with associated feature bundles (extended version of Ragel, (≥ V. 6.1) for handling ambiguity via introduction of multiple emission symbols = emission tuples) Ragel generated C code (jump-code)

Morpheme tables Ragel code Code Binary DOT Rules Code

slide-12
SLIDE 12

CroMo

  • D. Ćavar

Outline Introduction Model Evaluation Comments

Implementation

Emission (feature bundles): as one bit-vector Features mapped from the General Ontology for Linguistic Description (upper ontology)

possibility: reasoning over linguistic concepts and features

Optimization: mapping of concepts and their relations on a compressed bit-vector, maintaining inheritance and implicatures

top-node concept sub-class terminal-classes

slide-13
SLIDE 13

CroMo

  • D. Ćavar

Outline Introduction Model Evaluation Comments

Evaluation

Hardware: dual core 2.4 GHz Lexical base: 120,000 morphemes (and allomorphs) Speed: approx. 50,000 tokens per second with average morpheme count of 2.5 per token Size: binary footprint approx. 5 MB Compilation (tables → Ragel + C; Ragel → C + DOT; gcc → bin): approx. 5 minutes, min. 4 GB RAM for monolithic architecture

slide-14
SLIDE 14

CroMo

  • D. Ćavar

Outline Introduction Model Evaluation Comments

Comments

Interoperability issues addressed:

GOLD platform independent code code-page independence

Extensible (turnaround time of some minutes) Minimally invasive and minimalistic Open-source