 
              UNSUPERVISED MORPHOLOGICAL SEGMENTATION & CLUSTERING ICL UNI HEIDELBERG - HS CL4LRL - KATHARINA ALLGAIER - 08.06.2016 1
OVERVIEW  Introduction  Morphological Segmentation (Creutz&Lagus 2005)  Aims  Models  Evaluation  Results  Affix Clustering (Moon et al 2009)  Idea  Model  Results  Conclusion 2
WHAT ARE WE DOING?  Morpheme Segmentation reads = read + s  Morphemes = smallest meaning-bearing units machines = machine + s  = smallest elements of syntax translation = translate + ion  Meaning vs. Form goalkeeper = goal + keeper joystick = joy + stick  Composition vs. Perturbation 3
WHAT ARE WE DOING ?  Stem vs. Affixes (Prefixes + Suffixes)  Inflectional vs. Derivational  Affix Clustering 4
WHY ARE WE DOING IT?  important information  especially for highly inflected languages  (like Turkish, Finnish, Nahuatl, Japanese  agglutinative languges)  used in other CL applications  (language production, speech recognition, machine translation etc.) 5
INDUCING THE MORPHOLOGICAL LEXICON OF A NATURAL LANGUAGE FROM UNANNOTATED TEXT – CREUTZ&LAGUS 2005 „ algorithm for the unsupervised learning […] of a simple morphology of a natural language “   Unsupervised morpheme segmentation with hierarchical representation  English and Finnish 6
AIMS  Most accurate segmentation possible  Learn representation of the language in the data + store it in a lexicon  Based several models: Linguistica, Morfessor Baseline, Morfessor ML, Morfessor MAP 7
BASELINE Morph Lexicon talk teach es  Morfessor Baseline Algorithm (Creutz&Langus) ed ing word  Similar to some unsupervised word segmentation algorithms words morf  Construct lexicon of morphs es  Each word can be constructed out of those morphs sor  AIM: find optimal + concise segmentation and lexicon  PROBLEM: frequent words stored as a whole - rare words excessively split + stored in part no representation of a morph‘s inner structure 8
 Linguistica (Goldsmith 2001)  Splits word into stem + one (empty) prefix / affix ADVANTAGE: Modeling of simple word-internal syntax (morphotactics – rules on ordering of morphemes)  – grouping sets of stems & suffixes into inflectional paradigms word +s talk + ed talk + s dog + s walk + ed walk + s  DISADVANTAGE: handles highly inflecting + compounding languages poorly (alternating stems + affixes) 9
IMPROVED MODEL  Morfessor Categories-ML (Creutz&Lagus) hidden states: categories (SUFF, PRE, …)  Reanalyzes segmentation of Morphessor Baseline  Maximum Likelihood Model  Words represented as HMMs  Stems, prefixes + suffixes can alternate (with some restrictions) „ noise “ category   split words whose morphs are present in the lexicon   join „ noise “ morphs with their neighbours to form proper morphs  observable states: morphs  CRITICISM: too ad hoc + information on word frequency is lost 10
NEW MODEL  Morfessor Categories-MAP (Creutz&Lagus)  Induces binary hierarchical lexicon Retains inner structure of words  morphs represented as concatenation of (sub)morphs of the lexicon   Word frequency (own entry vs. Split into morphs) Prefix – Stem – Suffix – Non-morpheme  11
 Maximum a posteriori framework  Words represented as HMMs Desired level of segmentation : „ finest resolution that does not contain non-morphemes “  12
SEARCH ALGORITHM (GREEDY SEARCH) Initialisation of Representativ+ness segmentation stem+SUFF [Re+[present+ativ]]+[n+ess] Splitting of morphs PRE+stem+SUFF+non+SUFF [Re+[present+ativ]]+ness Joining of morphs PRE+stem+SUFF+SUFF [Re+[[pre+sent]+ativ]]+ness Splitting of morphs PRE+non+stem+SUFF+SUFF [Re+[[pre+sent]+ativ]]+ness Resegmentation of corpus + re-estimation PRE+non+stem+SUFF+SUFF of probabilitites [Re+[present+ativ]]+ness 13 Expansion to finest PRE+stem+SUFF+SUFF resolution
MODEL  AIM: Finding optimal lexicon + segmentation  Maximum a posteriori estimate to be maximized: Form Meaning String of letters vs. Submorphs Frequency 14 Morph emission probability transition probability Length Right+Left Perplexity
 Morph Emission Probabilities  probability that morph is emitted by the category   Depend on frequency of morph in training data  Prefix-/Suffix-Likeness (right+left perplexity)  Stem-Likeness (length)  Non-morpheme probability 15
EVALUATION Goldstandard English Data Finnish Data Hutmegs Prose + news + Prose + news text  Linguistic scientific text morpheme Finnish IT Centre of segmentations Gutenberg Project Science  1.4 million Finnish Gigaword Corpus Finnish National  120 000 English Brown Corpus News Agency word forms Evaluation on 10.000, 50.000, 250.000, 12/16 million 16 words
RESULTS 17
UNSUPERVISED MORPHOLOGICAL SEGMENTATION AND CLUSTERING WITH DOCUMENT BOUNDARIES – MOON ET AL 2009  Simple model without heuristics /thresholds /trained parameters  Word segmentation - constrain candidate stems + affixes by document boundaries Cluster affixes of certain stems  morphologically related words   USE: interlinearised glossed texts for LRL  English + Uspanteko 18
IDEA  two words in the same document are very similar in orthography  likely to be related morphologically  use document boundaries to filter out noise  constrain potential membership of word clusters 19 He suddendly drew a sharp sword … The documentation of …
MODEL Candidate Generation Conflation set : „Set of word types that are related through either inflectional or derivational morphology “ 20
CANDIDATE TRIE Stems  trunks like li ness  affixes branches ness hood 21
MODEL Candidate Generation (D vs. G) X2 testing : Correlation betw. Affixes Candidate Conflation set : Filtering „Set of word types that are related through either inflectional or derivational morphology “ Affix Clustering Word Clustering (D vs. G) 22
RESULTS 23
 Thank you for your attention! 24
Recommend
More recommend