unsupervised morphological segmentation clustering
play

UNSUPERVISED MORPHOLOGICAL SEGMENTATION & CLUSTERING ICL UNI - PowerPoint PPT Presentation

UNSUPERVISED MORPHOLOGICAL SEGMENTATION & CLUSTERING ICL UNI HEIDELBERG - HS CL4LRL - KATHARINA ALLGAIER - 08.06.2016 1 OVERVIEW Introduction Morphological Segmentation (Creutz&Lagus 2005) Aims Models


  1. UNSUPERVISED MORPHOLOGICAL SEGMENTATION & CLUSTERING ICL UNI HEIDELBERG - HS CL4LRL - KATHARINA ALLGAIER - 08.06.2016 1

  2. OVERVIEW  Introduction  Morphological Segmentation (Creutz&Lagus 2005)  Aims  Models  Evaluation  Results  Affix Clustering (Moon et al 2009)  Idea  Model  Results  Conclusion 2

  3. WHAT ARE WE DOING?  Morpheme Segmentation reads = read + s  Morphemes = smallest meaning-bearing units machines = machine + s  = smallest elements of syntax translation = translate + ion  Meaning vs. Form goalkeeper = goal + keeper joystick = joy + stick  Composition vs. Perturbation 3

  4. WHAT ARE WE DOING ?  Stem vs. Affixes (Prefixes + Suffixes)  Inflectional vs. Derivational  Affix Clustering 4

  5. WHY ARE WE DOING IT?  important information  especially for highly inflected languages  (like Turkish, Finnish, Nahuatl, Japanese  agglutinative languges)  used in other CL applications  (language production, speech recognition, machine translation etc.) 5

  6. INDUCING THE MORPHOLOGICAL LEXICON OF A NATURAL LANGUAGE FROM UNANNOTATED TEXT – CREUTZ&LAGUS 2005 „ algorithm for the unsupervised learning […] of a simple morphology of a natural language “   Unsupervised morpheme segmentation with hierarchical representation  English and Finnish 6

  7. AIMS  Most accurate segmentation possible  Learn representation of the language in the data + store it in a lexicon  Based several models: Linguistica, Morfessor Baseline, Morfessor ML, Morfessor MAP 7

  8. BASELINE Morph Lexicon talk teach es  Morfessor Baseline Algorithm (Creutz&Langus) ed ing word  Similar to some unsupervised word segmentation algorithms words morf  Construct lexicon of morphs es  Each word can be constructed out of those morphs sor  AIM: find optimal + concise segmentation and lexicon  PROBLEM: frequent words stored as a whole - rare words excessively split + stored in part no representation of a morph‘s inner structure 8

  9.  Linguistica (Goldsmith 2001)  Splits word into stem + one (empty) prefix / affix ADVANTAGE: Modeling of simple word-internal syntax (morphotactics – rules on ordering of morphemes)  – grouping sets of stems & suffixes into inflectional paradigms word +s talk + ed talk + s dog + s walk + ed walk + s  DISADVANTAGE: handles highly inflecting + compounding languages poorly (alternating stems + affixes) 9

  10. IMPROVED MODEL  Morfessor Categories-ML (Creutz&Lagus) hidden states: categories (SUFF, PRE, …)  Reanalyzes segmentation of Morphessor Baseline  Maximum Likelihood Model  Words represented as HMMs  Stems, prefixes + suffixes can alternate (with some restrictions) „ noise “ category   split words whose morphs are present in the lexicon   join „ noise “ morphs with their neighbours to form proper morphs  observable states: morphs  CRITICISM: too ad hoc + information on word frequency is lost 10

  11. NEW MODEL  Morfessor Categories-MAP (Creutz&Lagus)  Induces binary hierarchical lexicon Retains inner structure of words  morphs represented as concatenation of (sub)morphs of the lexicon   Word frequency (own entry vs. Split into morphs) Prefix – Stem – Suffix – Non-morpheme  11

  12.  Maximum a posteriori framework  Words represented as HMMs Desired level of segmentation : „ finest resolution that does not contain non-morphemes “  12

  13. SEARCH ALGORITHM (GREEDY SEARCH) Initialisation of Representativ+ness segmentation stem+SUFF [Re+[present+ativ]]+[n+ess] Splitting of morphs PRE+stem+SUFF+non+SUFF [Re+[present+ativ]]+ness Joining of morphs PRE+stem+SUFF+SUFF [Re+[[pre+sent]+ativ]]+ness Splitting of morphs PRE+non+stem+SUFF+SUFF [Re+[[pre+sent]+ativ]]+ness Resegmentation of corpus + re-estimation PRE+non+stem+SUFF+SUFF of probabilitites [Re+[present+ativ]]+ness 13 Expansion to finest PRE+stem+SUFF+SUFF resolution

  14. MODEL  AIM: Finding optimal lexicon + segmentation  Maximum a posteriori estimate to be maximized: Form Meaning String of letters vs. Submorphs Frequency 14 Morph emission probability transition probability Length Right+Left Perplexity

  15.  Morph Emission Probabilities  probability that morph is emitted by the category   Depend on frequency of morph in training data  Prefix-/Suffix-Likeness (right+left perplexity)  Stem-Likeness (length)  Non-morpheme probability 15

  16. EVALUATION Goldstandard English Data Finnish Data Hutmegs Prose + news + Prose + news text  Linguistic scientific text morpheme Finnish IT Centre of segmentations Gutenberg Project Science  1.4 million Finnish Gigaword Corpus Finnish National  120 000 English Brown Corpus News Agency word forms Evaluation on 10.000, 50.000, 250.000, 12/16 million 16 words

  17. RESULTS 17

  18. UNSUPERVISED MORPHOLOGICAL SEGMENTATION AND CLUSTERING WITH DOCUMENT BOUNDARIES – MOON ET AL 2009  Simple model without heuristics /thresholds /trained parameters  Word segmentation - constrain candidate stems + affixes by document boundaries Cluster affixes of certain stems  morphologically related words   USE: interlinearised glossed texts for LRL  English + Uspanteko 18

  19. IDEA  two words in the same document are very similar in orthography  likely to be related morphologically  use document boundaries to filter out noise  constrain potential membership of word clusters 19 He suddendly drew a sharp sword … The documentation of …

  20. MODEL Candidate Generation Conflation set : „Set of word types that are related through either inflectional or derivational morphology “ 20

  21. CANDIDATE TRIE Stems  trunks like li ness  affixes branches ness hood 21

  22. MODEL Candidate Generation (D vs. G) X2 testing : Correlation betw. Affixes Candidate Conflation set : Filtering „Set of word types that are related through either inflectional or derivational morphology “ Affix Clustering Word Clustering (D vs. G) 22

  23. RESULTS 23

  24.  Thank you for your attention! 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend