cromo morphological analysis for croatian
play

CroMo - Morphological Analysis for Croatian Model Evaluation - PowerPoint PPT Presentation

CroMo D. avar Outline Introduction CroMo - Morphological Analysis for Croatian Model Evaluation Comments Damir avar 1 , Ivo-Pavao Jazbec 2 and Tomislav Stojanov 2 Linguistics Department, University of Zadar 1 Institute of Croatian


  1. CroMo D. Ćavar Outline Introduction CroMo - Morphological Analysis for Croatian Model Evaluation Comments Damir Ćavar 1 , Ivo-Pavao Jazbec 2 and Tomislav Stojanov 2 Linguistics Department, University of Zadar 1 Institute of Croatian Language and Linguistics 2 FSMNLP 2008

  2. CroMo D. Ćavar Outline Introduction 1 Introduction Model Evaluation Model 2 Comments Evaluation 3 Comments 4

  3. Scenario CroMo D. Ćavar Synchronic and diachronic study of language change and acquisition models Outline Introduction Language data from a long period of time, and three major dialects in Croatia implying: Model Evaluation Variation wrt. e.g. string-based morphology or feature bundles Comments Ongoing discovery wrt. string combinatorics and features Research questions require quantitative and qualitative information: of phonological, morphological, syntactic and semantic tokens and feature bundles, and their correlation and variation at various stages over time

  4. Morphological segmentation and annotation and lemmatization, and . . . CroMo D. Ćavar Outline Segmenting words: Introduction isponapijali su se “they got drunk a little bit to satisfaction” Model is – po – napija – li Evaluation Annotating segments: Comments aspect prefix – aspect prefix – from stem-lemma napiti – plural participle Extending the annotation: to a certain saturation – a little bit – “get drunk” from root-lemma piti – past event

  5. FSA Architecture Mapping of morph-groups to DFSAs (Mealy or Moore machine): CroMo D. Ćavar Outline a v pref )-index asp n 2 1 č Introduction 1 i p t a v root )-index p o e 2 3 4 v pref )-index asp 0 0 3 4 š Model 5 v pref (-index v root (-index Evaluation Comments v suf )-index pres 1st pl o 8 m v suf )-index pres 1st sg 2 v suf )-index pres 2st sg š 3 v suf )-index pres 3rd sg ε 1 0 t e v suf )-index pres 2nd pl 4 5 j v suf )-index 2nd sg imper 6 v suf (-index u v suf )-index pres 3rd pl 7

  6. FSA Architecture CroMo D. Ćavar Outline Introduction Mapping ambiguity on emission: emission tuple 1 to n Model Label DFSAs with variable names Evaluation Use rules referring to variable names for modeling of Comments morphotactic regularities: verbAspectPrefs* . verbAtiRoots . verbInflSuf

  7. FSA Architecture CroMo D. Ćavar Generating potentially cyclic DFSAs: Outline Introduction Model v suf )-index pres 1st sg Evaluation o m 13 v suf )-index pres 1st pl 19 Comments v suf )-index pres 2st sg š 14 č ε v root )-index 6 i p v suf )-index pres 3rd sg ε t a ε 12 e 7 9 10 ε 5 ε š 8 t p o 11 0 3 4 v root (-index e v suf )-index pres 2nd pl n v pref )-index asp 15 16 j 1 a ε v pref (-index 2 u v suf )-index pres 3rd pl v suf (-index 17 18 v pref )-index asp v suf )-index 2nd sg imper

  8. FSA Architecture CroMo D. Ćavar Ambiguity mapped on emission tuple: Outline Introduction Model Evaluation Comments e [ v suff )-index; adj suff )-index ] 4 o r [ v root )-index; adj root )-index ] g 1 2 3 [ v suff (-index; adj suff (-index ] 0 [ v root (-index; adj root (-index ]

  9. FSA Architecture CroMo D. Ćavar Lemmatization as a rule: Outline Rightmost root is the semantic head Introduction Root-lemma: generate canonical word-form from the Model right-most root Evaluation neprijatelja → ne + prijatelj + a → NEG + N-root + ACC Comments “not friend” = “enemy” �⇒ ¬ friend not compositional! but useful for semantic field analysis! root-lemma: neprijatelja → prijatelj Stem/base-lemma: generate canonical word-form from the stem without inflectional suffixes base-lemma: neprijatelja → neprijatelj

  10. FSA Architecture CroMo D. Ćavar Lemmatization (Hack): Outline emission of byte-offset for suffix-elimination Introduction Model pointer to suffix string Evaluation Clean solution: Comments g:(g, (FV, ...)) o:(o, ()) r:(r, ()) e:(a, (FV, ...)) 0 1 2 3 4

  11. Implementation CroMo C++ wrapper for final application D. Ćavar Ragel code (automaton definition) generated from Outline morpheme DBs and rules, with associated feature bundles Introduction (extended version of Ragel, ( ≥ V. 6.1) for handling Model ambiguity via introduction of multiple emission symbols = Evaluation emission tuples) Comments Ragel generated C code (jump-code) Morpheme tables Rules Ragel code Code Code DOT Binary

  12. Implementation CroMo Emission (feature bundles): as one bit-vector D. Ćavar Features mapped from the General Ontology for Linguistic Description (upper ontology) Outline Introduction possibility: reasoning over linguistic concepts and features Model Optimization: mapping of concepts and their relations on a Evaluation compressed bit-vector, maintaining inheritance and Comments implicatures top-node concept sub-class terminal-classes

  13. Evaluation CroMo D. Ćavar Outline Hardware: dual core 2.4 GHz Introduction Lexical base: 120,000 morphemes (and allomorphs) Model Evaluation Speed: approx. 50,000 tokens per second with average Comments morpheme count of 2.5 per token Size: binary footprint approx. 5 MB Compilation (tables → Ragel + C; Ragel → C + DOT; gcc → bin): approx. 5 minutes, min. 4 GB RAM for monolithic architecture

  14. Comments CroMo D. Ćavar Outline Interoperability issues addressed: Introduction Model GOLD Evaluation platform independent code Comments code-page independence Extensible (turnaround time of some minutes) Minimally invasive and minimalistic Open-source

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend