malindo morph morphological dictionary and analyser for
play

MALINDO Morph: Morphological dictionary and analyser for - PowerPoint PPT Presentation

MALINDO Morph: Morphological dictionary and analyser for Malay/Indonesian 7 May 2018 Tie 13th Workshop on Asian Language Resources (ALR13) Hiroki Nomoto Hannah Choi David Moeljadi Francis Bond Tokyo University of Foreign


  1. MALINDO Morph: Morphological dictionary and analyser for Malay/Indonesian 7 May 2018 Tie 13th Workshop on Asian Language Resources (ALR13) Hiroki Nomoto ⋆ Hannah Choi ◦ David Moeljadi ◦ Francis Bond ◦ ⋆ Tokyo University of Foreign Studies, ◦ Nanyang Technological University . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  2. . . . . . . . . . . . . Morphological dictionaries in NLP . Lemmatization is an important task for morphological analysis A good dictionary with wide coverage is crucial to the success of a robust morphological analysis, which in turn becomes the basis for higher-level tasks such as syntactic parsing. Open dictionaries for Japanese Nothing comparable exists for Malay/Indonesian. So we created a morphological dictionary for Malay/Indonesian: MALINDO Morph Nomoto, Choi, Moeljadi, Bond MALINDO Morph ALR13 . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 / 34 ▶ NAIST Japanese Dictionary (IPAL) ▶ UniDic

  3. . 1 . . . . . . . . . . Organization Malay and Indonesian . 2 Existing tools and their problems 3 MALINDO Morph and its creation 4 Ways of using MALINDO Morph 5 Future work Nomoto, Choi, Moeljadi, Bond MALINDO Morph ALR13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 / 34 ▶ Tieir relationship ▶ Morphology

  4. . Malay and Indonesian . . . . . . . . . . Tie “Malay” language ( msa 1 ): offjcial language of four countries . in the Malay Archipelago. Two regional varieties: Singapore Many tools and resources have been independently developed in each region. But the languages are mutually intelligible (about 10% lexical difgerence (Asmah, 2001)) and share the same set of affjxes. 1 ISO693-3 Nomoto, Choi, Moeljadi, Bond MALINDO Morph ALR13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 / 34 ▶ Malay in the narrow sense ( zsm 1 ), used in Malaysia, Brunei and ▶ Indonesian ( ind 1 ), used in Indonesia ⇒ A common morphological dictionary can be developed.

  5. . . . . . . . . . . . . . . . Malay/Indonesian Morphology Malay/Indonesian morphology involves the use of Affjxation Reduplication Cliticization Nomoto, Choi, Moeljadi, Bond MALINDO Morph ALR13 . . . . . . . . . . . . . . . . . . . . . . . . . 5 / 34

  6. . Productive: Prefjxes, suffjxes and circumfjxes . . . . . . . . . . Affjxation Non-productive: Infjxes . (1) a. Prefjx b. Suffjx c. Circumfjx batas ‘limit’ + peN- -an Nomoto, Choi, Moeljadi, Bond MALINDO Morph ALR13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 / 34 batas ‘limit’ + ter- → terbatas ‘limited’ batas ‘limit’ + -an → batasan ‘limitation’ → pembatasan ‘delimiting’

  7. . Semi-productive: Partial and rhythmic reduplication . . . . . . . . . Reduplication Productive: Full reduplication (2) . a. Full reduplication b. Rhythmic reduplication (vowel and/or consonant alternation) c. Partial reduplication (base-initial consonant + e + base) (Malay) Nomoto, Choi, Moeljadi, Bond MALINDO Morph ALR13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 / 34 kucing ‘cat’ → kucing-kucing ‘cats’ gunung ‘mountain’ → gunug-ganang ‘mountain range’ mula ‘to start’ → memula ‘at fjrst’

  8. . Cliticization . . . . . . . . . . Proclitics . Enclitics (3) a. Proclitic (before the base) terima ‘to receive’ + ku= ‘I’ b. Enclitics (afuer the base) buku ‘book’ + =ku ‘me/my’ Nomoto, Choi, Moeljadi, Bond MALINDO Morph ALR13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 / 34 → kuterima ‘I receive’ → bukuku ‘my book’

  9. . Interaction of difgerent morphological processes . . . . . . . . . . batas . ‘limit’ terbatas ‘limited’ keterbatasan ‘limitation’ +reduplication keterbatasan-keterbatasan ‘limitations’ Nomoto, Choi, Moeljadi, Bond MALINDO Morph ALR13 . . . . . . . . . . . . . . . 9 / 34 . . . . . . . . . . . . . ↓ +affjxation: ter- ↓ +affjxation: ke- -an ↓

  10. . Existing morphological dictionaries . . . . . . . . . . No large dictionary fjle is publicly available in an accessible . format. Baldwin and Su’ad’s (2006) Malay tokenizer/lemmatizer: Word-lemma-POS triples for 2,499 words. One can create a larger dictionary by using the data from online dictionaries. However, no existing dictionary contains all the kinds of morphological information that MALINDO Morph ofgers: affjxes, clitics and reduplication types. Nomoto, Choi, Moeljadi, Bond MALINDO Morph ALR13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 / 34

  11. . . . . . . . . . . . . Existing morphological analysers . Stemmers/lemmatizers Identify the stem/lemma. Much work has been done (Baldwin and Su’ad, 2006; Adriani et al., 2007; Larasati et al., 2011; Mohamad Nizam et al., 2016). Morphological analysers Also analyse the non-stem/lemma strings. MorphInd (Larasati et al., 2011) seems to be the most sophisticated morphological analyser. Nomoto, Choi, Moeljadi, Bond MALINDO Morph ALR13 . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 / 34

  12. . 1 . . . . . . . . MorphInd (Larasati et al., 2011) MorphInd identifjes morpheme boundaries and assigns two POS tags to a token: ‘Lemma tag’ (POS tag for the lemma) . 2 ‘Morphological tag’ (POS tag for the entire token) (4) a. Input: mengirim ‘to deliver’ b. <v> : lemma tag for verbs _VSA :morphological tag indicating that the entire token is a singular active verb Nomoto, Choi, Moeljadi, Bond MALINDO Morph ALR13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 / 34 Output: meN+kirim<v>_VSA

  13. . prefjx and a suffjx. . . . . . . . . . A common misunderstanding among NLP Circumfjxes are incorrectly thought of as a combination of a MorphIndo does not specify whether the non-lemma strings are . a prefjx, suffjx or circumfjx. (5) a. -an ) b. —Not obvious whether peN and an are a combination of two morphemes (prefjx peN- and suffjx -an ) or a single morpheme (circumfjx peN- -an )… Nomoto, Choi, Moeljadi, Bond MALINDO Morph ALR13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 / 34 researchers: Circumfjx ≡ prefjx + suffjx Input: pengiriman ‘delivery’ (= kirim + circumfjx peN- Output: peN+kirim<v>+an_NSD

  14. . . . . . . . . . . . . Circumfjx or “prefjx + suffjx”? . Tie correct identifjcation of circumfjxes presents a major challenge to morphological analysis in Malay/Indonesian. A correct circumfjx cannot be identifjed by just looking at the two strings at the lefu and right edges of a token. (6) berakhiran ‘suffjxed’ NOT akhir + circumfjx ber- -an BUT[ akhir + suffjx -an ] + prefjx ber- Nomoto, Choi, Moeljadi, Bond MALINDO Morph ALR13 . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 / 34

  15. . . . . . . . . . . . . . . MALINDO Morph and its format Available at https://github.com/matbahasa/MALINDO_Morph Licensed under a CC BY 4.0 license. Version 20180418 has 232,516 lines (case-sensitive). Each line is made up of: Also include the analyser: morph_analyzer.py Nomoto, Choi, Moeljadi, Bond MALINDO Morph ALR13 . . . . . . . . . . . . . . . . . . . . . 15 / 34 . . . . . ▶ ID ▶ Root ▶ Surface form ▶ Prefjx(es), proclitic ▶ Suffjx(es), enclitic(s) ▶ Circumfjx(es) ▶ Reduplication type

  16. . 0 perlu 0 se- -nya 0 0 seperlunya perlu 0 meN- 0 0 perlu perlu Reduplication Circumfjx Suffjx memerlukan -kan Surface form keperluan ALR13 MALINDO Morph Nomoto, Choi, Moeljadi, Bond 0 ke- -an 0 0 perlu 0 memerlukan R-full 0 -kan meN- perlu- perlu 0 Prefjx Root . . . . . . . . . . . . . . . . . . . . . Example: perlu ‘necessary’ and its derivatives . . . . . . . . . . . . . . . . . . 16 / 34

  17. . Two steps in building MALINDO Morph . . . . . . . . . . 1 . Core dictionary Entries from the authoritative dictionaries in Malaysia and we would like to thank them for their cooperation 2 Expanded dictionary Other tokens found in the reclassifjed version of the Leipzig Corpora Collection for Malay and Indonesian (LCC; Goldhahn et al., 2012; Nomoto et al., under review) Nomoto, Choi, Moeljadi, Bond MALINDO Morph ALR13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 / 34 Indonesia ( Kamus Dewan 4 (KD) and Kamus Besar Bahasa Indonesia 5 (KBBI))

  18. . Core . . . . . . Sizes of the MALINDO Morph dictionaries (unit: line) Dictionary Checked Unchecked Total 84,404 . 0 84,404 Expanded 47,400 100,712 148,112 Total 131,804 100,712 232,516 Nomoto, Choi, Moeljadi, Bond MALINDO Morph ALR13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 / 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend