arabic pos tagging
play

Arabic POS Tagging Results Error Analysis Conclusion Emad - PowerPoint PPT Presentation

Arabic POS Tagging Arabic + POS Tagging Data + Experiments Segmentation POS Tagging Arabic POS Tagging Results Error Analysis Conclusion Emad Mohamed, Sandra K ubler Indiana University 1 / 13 The Structure of Arabic Words Arabic


  1. Arabic POS Tagging Arabic + POS Tagging Data + Experiments Segmentation POS Tagging Arabic POS Tagging Results Error Analysis Conclusion Emad Mohamed, Sandra K¨ ubler Indiana University 1 / 13

  2. The Structure of Arabic Words Arabic POS Tagging Arabic + POS Tagging ◮ An Arabic word may consist of several segments. Data + Experiments ◮ Possible segments: inflectional affixes, the stem, Segmentation clitics POS Tagging ◮ example: WsyktbwnhA (Engl.: and they will write it ): Results ◮ conjunction: w Error Analysis ◮ future particle: s Conclusion ◮ 3rd person imperfect verb prefix: y ◮ imperfect verb: ktb ◮ 3rd person feminine singular object pronoun: hA 2 / 13

  3. The Structure of Arabic Words Arabic POS Tagging Arabic + POS Tagging ◮ An Arabic word may consist of several segments. Data + Experiments ◮ Possible segments: inflectional affixes, the stem, Segmentation clitics POS Tagging ◮ example: WsyktbwnhA (Engl.: and they will write it ): Results ◮ conjunction: w Error Analysis ◮ future particle: s Conclusion ◮ 3rd person imperfect verb prefix: y ◮ imperfect verb: ktb ◮ 3rd person feminine singular object pronoun: hA ◮ POS tag: [CONJ+FUTURE PARTICLE+ IMPERFECT VERB PREFIX+IMPERFECT VERB+ IMPERFECT VERB SUFFIX MASC PLURAL 3RD PERSON+ OBJECT PRONOUN FEM SINGULAR] 2 / 13

  4. Tagging Approaches Arabic POS Tagging Arabic + POS Tagging ◮ whole word tagging: assign complex tag to complete Data + word Experiments Segmentation POS Tagging Results Error Analysis ◮ segment-based tagging: segment first; then assign Conclusion tags to segments 3 / 13

  5. Tagging Approaches Arabic POS Tagging Arabic + POS Tagging ◮ whole word tagging: assign complex tag to complete Data + word Experiments wsyktbwnhA : Segmentation POS Tagging CONJ+FUT+IV3MS+IV+IVSUFF SUBJ:MP MOOD:I+IVSUFF DO:3FS Results Error Analysis ◮ segment-based tagging: segment first; then assign Conclusion tags to segments ◮ w : CONJ ◮ s : FUT ◮ y : IV3MS ◮ ktb : IV ◮ wn : SUBJ:MP MOOD:I ◮ hA : IVSUFF DO:3FS 3 / 13

  6. Tagging Approaches Arabic POS Tagging Arabic + POS Tagging ◮ whole word tagging: assign complex tag to complete Data + word Experiments wsyktbwnhA : Segmentation POS Tagging CONJ+FUT+IV3MS+IV+IVSUFF SUBJ:MP MOOD:I+IVSUFF DO:3FS 993 tags Results Error Analysis ◮ segment-based tagging: segment first; then assign Conclusion tags to segments ◮ w : CONJ ◮ s : FUT ◮ y : IV3MS ◮ ktb : IV ◮ wn : SUBJ:MP MOOD:I ◮ hA : IVSUFF DO:3FS 139 tags 3 / 13

  7. Data Set & Experimental Setup Arabic POS Tagging Arabic + POS Tagging Data + ◮ Penn Arabic Treebank (after-treebank POS files) Experiments Segmentation ◮ P1V3 + P3V1: ca. 500 000 words POS Tagging ◮ non-vocalized version Results Error Analysis ◮ reattached conjunctions, prepositions, pronouns, etc. Conclusion to get text as written ◮ remove null elements: { i$otaraY+(null) / PV+PVSUFF SUBJ:3MS ⇒ { i$otaraY / PV ◮ 5-fold cross validation ◮ evaluation: per-segment accuracy (SAR) + per-word accuracy (WAR) 4 / 13

  8. Memory-Based Segmentation Arabic POS Tagging Arabic + POS Tagging Data + ◮ per character classification: segment-end, Experiments Segmentation no-segment-end POS Tagging ◮ memory-based learning: TiMBL Results Error Analysis ◮ features: focus character, previous 5 characters, and Conclusion following 5 characters, POS tag for word based on whole word tagging ◮ TiMBL parameters: IB, overlap metric, gain ratio weighting, nearest neighbors k = 1 ◮ two rounds: in second round include class from first round 5 / 13

  9. Segmentation Results Arabic POS Tagging Arabic + POS Tagging Data + all words: 98.23% Experiments known words: 99.75% Segmentation unknown words: 82.22% POS Tagging Results Error Analysis Conclusion 6 / 13

  10. Segmentation Results Arabic POS Tagging Arabic + POS Tagging Data + all words: 98.23% Experiments known words: 99.75% Segmentation unknown words: 82.22% POS Tagging Results Error Analysis Conclusion proper noun errors: 33.87% of all errors % unknown words in data: 8.5% 6 / 13

  11. POS Tagging Arabic POS Tagging Arabic + POS Tagging Data + Experiments ◮ memory-based tagger: MBT Segmentation ◮ parameters: Modified Value Difference metric, k = 25 POS Tagging Results ◮ for known words : IGTree, 2 words to left, their POS Error Analysis tags, focus word, its ambitag, 1 right context word, its Conclusion ambitag ◮ for unknown words : IB1, focus word, first 5 + last 3 characters, 1 left context word + its POS tag, 1 right context word + its ambitag ◮ previous decisions are included 7 / 13

  12. POS Tagging Results Arabic POS Tagging Arabic + POS Tagging Data + Experiments Segmentation POS Tagging Results Error Analysis gold standard seg. segmentation-based whole words Conclusion SAR WAR SAR WAR WAR 96.72% 94.91% 94.70% 93.47% 94.74% 8 / 13

  13. POS Tagging Results Arabic POS Tagging Arabic + POS Tagging Data + Experiments Segmentation POS Tagging Results Error Analysis gold standard seg. segmentation-based whole words Conclusion SAR WAR SAR WAR WAR 96.72% 94.91% 94.70% 93.47% 94.74% 8 / 13

  14. POS Tagging Results Arabic POS Tagging Arabic + POS Tagging Data + Experiments Segmentation POS Tagging Results Error Analysis gold standard seg. segmentation-based whole words Conclusion SAR WAR SAR WAR WAR 96.72% 94.91% 94.70% 93.47% 94.74% 8 / 13

  15. Discussion Arabic POS Tagging Arabic + POS ◮ gold standard segmentation: upper bound Tagging Data + ◮ gives best results Experiments Segmentation POS Tagging ◮ no gold standard segmentation available: whole Results Error Analysis words better than automatic segmentation Conclusion ◮ segmentation → more ambiguity per segment ◮ small percentage of unknown words ◮ in segmentation-based tagging, 28% of all errors are results of wrong segementation 9 / 13

  16. Known vs. Unknown Words Arabic POS Tagging Arabic + POS Tagging Data + Experiments Segmentation POS Tagging Results Error Analysis gold std. seg. seg.-based whole words Conclusion known words 95.90% 95.57% 96.61% unknown words 84.25% 71.06% 74.64% 10 / 13

  17. Known vs. Unknown Words Arabic POS Tagging Arabic + POS Tagging Data + Experiments Segmentation POS Tagging Results Error Analysis gold std. seg. seg.-based whole words Conclusion known words 95.90% 95.57% 96.61% unknown words 84.25% 71.06% 74.64% 10 / 13

  18. Known vs. Unknown Words Arabic POS Tagging Arabic + POS Tagging Data + Experiments Segmentation POS Tagging Results Error Analysis gold std. seg. seg.-based whole words Conclusion known words 95.90% 95.57% 96.61% unknown words 84.25% 71.06% 74.64% 10 / 13

  19. Known vs. Unknown Words Arabic POS Tagging Arabic + POS Tagging Data + Experiments Segmentation POS Tagging Results Error Analysis gold std. seg. seg.-based whole words Conclusion known words 95.90% 95.57% 96.61% unknown words 84.25% 71.06% 74.64% 10 / 13

  20. Error Analysis Arabic POS Tagging confusion sets: Arabic + POS Tagging Data + Experiments gold tagger % of errors Segmentation noun adjective 7.88% POS Tagging adjective noun 7.75% Results proper noun noun 9.10% Error Analysis Conclusion noun proper noun 2.51% 11 / 13

  21. Error Analysis Arabic POS Tagging confusion sets: Arabic + POS Tagging Data + Experiments gold tagger % of errors Segmentation noun adjective 7.88% POS Tagging adjective noun 7.75% Results proper noun noun 9.10% Error Analysis Conclusion noun proper noun 2.51% ◮ no clear distinction between nouns and adjectives in Arabic: adjectives behave morphologically like nouns and can be used as nouns ◮ proper nouns are normally standard nouns, and are no marked specifically 11 / 13

  22. Comparison to Habash & Rambow Arabic POS Tagging Arabic + POS Tagging Data + Experiments ◮ whole word tagging Segmentation POS Tagging ◮ then convert to Habash & Rambow tokenization + Results reduced tagset: 15 tags Error Analysis Conclusion H&R ATB1 H&R ATB2 whole word tagger Token. acc. 99.1 – 99.33 POS acc. 98.1 96.5 96.41 12 / 13

  23. Conclusion & Future Work Arabic POS Tagging Arabic + POS Tagging Data + ◮ whole word tagging has higher accuracy than Experiments Segmentation segmentation based tagging POS Tagging ◮ no preprocessing necessary Results ◮ but Penn Arabic Treebank has low percentage of Error Analysis Conclusion unknown words ◮ segmentation quality is bottleneck for improving segmentation-based tagger ◮ need to find more reliable segmentation ◮ will integrate vocalization with segmentation 13 / 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend