learning morphology and phonology
play

Learning morphology and phonology John Goldsmith University of - PowerPoint PPT Presentation

Learning morphology and phonology John Goldsmith University of Chicago MoDyCo/Paris X Learning morphology and phonology John Goldsmith University of Chicago MoDyCo/Paris X All the particular properties that give a language its unique


  1. 2. MDL Efgect of having fewer “words” altogether f ∆ ∆ = 2 define f as log ; then pr ( C ) f 1 pr ( th ) − ∆ + ∆ + ∆ + Z Z [ t ] t [ h ] h [ th ] log 2 1 1 1 pr ( t ) pr ( h ) 2 2 This is positive if Lexicon 2 is better

  2. 2. MDL Efgect of frequency of /t/ and /h/ decreasing f ∆ ∆ = 2 define f as log ; then pr ( C ) f 1 pr ( th ) − ∆ + ∆ + ∆ + Z Z [ t ] t [ h ] h [ th ] log 2 1 1 1 pr ( t ) pr ( h ) 2 2 This is positive if Lexicon 2 is better

  3. 2. MDL Efgect /th/ being treated as a unit rather than separate pieces f ∆ ∆ = 2 define f as log ; then pr ( C ) f 1 pr ( th ) − ∆ + ∆ + ∆ + Z Z [ t ] t [ h ] h [ th ] log 2 1 1 1 pr ( t ) pr ( h ) 2 2 This is positive if Lexicon 2 is better

  4. 2. MDL 2.3 Results • The Fulton County Grand Ju ry s aid Friday an investi gation of At l anta 's recent prim ary e lection produc ed no e videnc e that any ir regul ar it i e s took place . • Thejury further s aid in term - end present ment s thatthe City Ex ecutive Commit t e e ,which had over - all charg e ofthe e lection , d e serv e s the pra is e and than k softhe City of At l anta forthe man ner in whichthe e lection was conduc ted. Chunks are too big Chunks are too small

  5. 2. MDL Summary 1. Word segmentation is possible , using (1) variable length strings ( multigrams) , (2) a probabilistic model of a corpus and (3) a search for maximum likelihood, if (4) we use MDL to tell us when to stop adding to the lexicon. 2. The results are interesting , but they sufger from being incapable of modeling real linguistic structure beyond simple chunks.

  6. 2. MDL Summary 1. Word segmentation is possible , using (1) variable length strings ( multigrams) , (2) a probabilistic model of a corpus and (3) a search for maximum likelihood, if (4) we use MDL to tell us when to stop adding to the lexicon. 2. The results are interesting , but they sufger from being incapable of modeling real linguistic structure beyond simple chunks.

  7. 2. MDL Question: Will we fjnd that types of linguistic structure correspond naturally to ways of improving our MDL model, either to increase the probability of the data , or to decrease the size of the grammar ?

  8. 3. Morphology ( primo ) Problem: Given a set of words, fjnd the best morphological structure for the words – where “best” means it maximally agrees with linguists (where they agree with each other!). Because we are going from larger units to smaller units (words to morphemes), the probability of the data is certain to decrease . The improvement will come from drastically shortening the grammar = discover regularities.

  9. 3. Morphology Naïve MDL Corpus: Analysis: jump, jumps, Stems : jump laugh jumping sing sang dog (20 letters) laugh, laughed, laughing Suffjxes : s ing ed (6 letters) sing, sang, singing Unanalyzed : the (3 the, dog, dogs letters) total: 62 letters total: 29 letters.

  10. 3. Morphology Model/heuristic 1 st approximation: a Unlike the word morphology is: segmentation problem, now we 1. a list of stems, have no obvious 2. a list of affjxes search heuristics. (prefjxes, These are very suffjxes), and important (for that 3. a list of pointers reason)—and I will indicating which not talk about combinations are them. permissible.

  11. 3. Morphology Size of model M[orphology] = { Stems T, Affjxes F, Signatures Σ } = + + Σ M T F ∑ = T t stems s = string length ( s ) * log( 26 ) ∈ t T ∑ | s | | s | = affjxes F f ∑ ∑ = = − or s [ i ] log freq ( s [ i ]) i = 1 i = 1 ∈ f F What is a signature, ∑ sig’s Σ = σ and what is its length? σ ∈ T extensivit y

  12. 3. Morphology What is a signature? account élevé NULL       NULL         appeal équipé e         ed         attack étonnant s         ing         40 more ... 78 more es      

  13. What is the length (=information content) of a signature? A signature is an ordered pair of two sets of pointers: (i) a set of pointers to stems; and (ii) a set of pointers to affjxes. The length of a pointer p is –log freq ( p) : σ [ W ] [ ] ∑ ∑ ∑ + So the total length of the signatures is: ( log log ) σ [ t ] [ f in ] σ ∈ ∈ σ ∈ σ Sigs t Stems ( ) f Suffixes ( ) Sum over signatures Sum over stem ptrs

  14. 3. Morphology Generation 1 Linguistica http://linguistica.uchicago.edu Initial pass: assumes that words are composed of 1 or 2 morphemes; fjnds all cases where signatures exist with at least 2 stems and 2 affjxes:  NULL   jump    ed     walk     ing  

  15. 3. Morphology Generation 1 Then it refjnes this initial approximation in a large number of ways, always trying to decrease the description length of the initial corpus.

  16. 3. Morphology Refjnements 1. Correct errors in segmentation  affirmati   affirm      aggressi on aggress ion         ⇒         attenti ve attent ive             20 more 20 more     2. Create signatures with only one observed stem: we have NULL, ed, ion, s as suffjxes, but only one stem ( act ) with exactly those suffjxes.

  17. 3. Morphology 3. Find recursive structure: allow stems to be analyzed. Minilexicon 1 Words 1 Affixes Stems 1 Signa- tures 1 Minilexicon 2 Words 2 = Stems 1 Affixes 2 Stems 2 Signa- tures 2

  18. French roots 3. Morphology

  19. 3. Morphology 4. Detect allomorphy Signature: <e>ion . NULL composite concentrate corporate détente discriminate evacuate infmate opposite participate probate prosecute tense What is this? composite and composition composite  composit  composit + ion It infers that ion deletes a stem-fjnal ‘e’ before attaching.

  20. 3. Morphology 3. Summary Works very well on European languages. Challenges: 1. Works very poorly on languages with richer morphologies (average # morphemes/word >> 2 ). ( Most languages have rich morphologies .) 2. Various other defjciencies.

  21. 4. Morphology ( secundo ) The initial bootstrap in the previous version does not even work on most languages, where the expected morphology contains sequences of 5 or more morphemes.

  22. 4. Morphology Swahili verb imb ni pend ni li u w fik ku a ka sem Ø a m tu som ta tu w wa taka on a l na chaku

  23. 4. Morphology Swahili verb imb ni pend ni li u w fik ku a ka sem Ø a m tu som ta tu wa taka on wa l na chaku Subject marker

  24. 4. Morphology Swahili verb imb ni pend ni li u w fik ku a ka sem Ø a m tu som ta tu taka wa on wa l Subject na marker chaku Tense marker

  25. 4. Morphology Swahili verb imb ni pend ni li u w fik ku a ka sem Ø a m tu som ta tu wa taka on wa l Subject na marker chaku Tense marker Object marker

  26. 4. Morphology Swahili verb imb ni pend ni li u w fik ku a ka sem Ø a m tu som ta tu wa taka on wa l Subject Object na marker marker chaku Tense wa marker Root

  27. 4. Morphology Swahili verb imb ni pend ni li u w fik ku a ka sem Ø a m tu som ta tu wa taka on wa l Subject Object na marker marker chaku Tense Root marker Voice (active/passive)

  28. 4. Morphology Swahili verb imb ni pend ni li u w fik ku a sem Ø a m ka tu som ta tu wa taka on wa l Subject Object na marker marker chaku Tense Root marker Voice (active/passive) Final vowel

  29. 4. Morphology Finite state automaton (FSA) SF 1 PF 1 SF 2 SF 3

  30. Signature: 4. Morphology reduces false positives NULL   jump     ed     walk     ing   PF 1 SF 1 SF 2 PF 3 SF 3

  31. 4. Morphology Generalize the signature… M 1 M 4 M 7 M 2 M 5 M 8 M 3 M 6 M 9 Sequential FSA: each state has a unique successor.

  32. 4. Morphology Alignments m 2 1 m 1 2 m 4 4 4 3 m 3

  33. Alignments: String edit distance algorithm n i l i m u p e n d a n i t a k a m u p e n d a

  34. 4. Morphology Alignments: make cuts n i l i m u p e n d a n i t a k a m u p e n d a

  35. 4. Morphology Elementary alignment m 2 1 m 1 2 m 4 4 4 3 m 3

  36. Collapsing elementary alignments m 2 1 m 1 2 m 4 4 4 3 m 3 context context m 7 1 m 1 2 m 4 4 4 3 m 8

  37. T wo or more sequential FSAs with identical contexts are collapsed: m 7 m 2 1 m 1 2 m 4 4 4 3 m 3 m 8

  38. 3. Further collapsing FSAs li a yesema 1 2 4 4 3 na li a mfuata 1 2 4 4 3 na li yesema a 1 2 4 4 3 na mfuata

  39. 4.3 T op templates: 8,200 Swahili words State 1 State 2 State 3 a, wa (sg., pl. human 246 stems subject markers) ku, hu (infinitive, 51 stems habitual markers) wa (pl. subject ka, li (tense 25 stems marker) markers) a (sg. subject ka, li (tense 29 stems marker) markers) a (sg. subject ka, na (tense 28 stems marker) markers) 37 strings w (passive marker) / Ø a

  40. 4. Morphology Precision and recall Precision Recall F-score String edit 0.77 0.57 0.65 distance Stem- 0.54 0.14 0.22 affix Affix- 0.68 0.20 0.31 stem

  41. Collapsed templates One Template The other template Collapsed Template % found on Yahoo search 1 {a}-{ka,na}- {a}-{ka,ki}-{stems} {a}-{ka,ki,na}-{stems} 86 (37/43) {stems} 2 {wa}-{ka,na}- {wa}-{ka,ki}-{stems} {wa}-{ka,ki,na}-{stems} 95 (21/22) {stems} 3 {a}-{ka,ki,na}- {wa}-{ka,ki,na}- {a,wa}-{ka,ki,na}- 84 (154/183) {stems} {stems} {stems} 4 {a}-{liye,me}- {a}-{liye,li}-{stems} {a}-{liye,li,me}-{stems} 100 (21/21) {stems} 5 {a}-{ki,li}-{stems} {wa}-{ki,li}-{stems} {a,wa}-{ki,li}-{stems} 90 (36/40) 6 {a}-{lipo,li}- {wa}-{lipo,li}-{stems} {a,wa}-{lipo,li}-{stems} 90 (27/30) {stems} 7 {a,wa}-{ki,li}- {a,wa}-{lipo,li}- {a,wa}-{ki,lipo,li}- 74 (52/70) {stems} {stems} {stems} 8 {a}-{na,naye}- {a}-{na,ta}-{stems} {a}-{na,ta,naye}-{stems} 80 (12/15) {stems}

  42. 4. 1 Evaluating the robustness of these templates (sequential FSAs) • Measure: How many letters do we save by expressing words in a template rather than by writing each one out individually? Answer: 36 -17 = 19. li yesema a 1 2 4 4 3 na mfuata

  43. Most edges are 4. Morphology convergent … adjectives car- s -a- rubi- negr- -o- Ø pequeñ- -o com -es -e cre- verbs -emos

  44. But some diverge (Spanish): Participle-forming suffix

  45. 4. Morphology English has much the same:

  46. 4. Morphology 4. Summary We need to enrich the heuristics and consider a broader set of possible grammars. With that, improvements seem to be unlimited at this point in time. Focus: Decrease the length of the analysis, especially in the length of the substance (morphemes) described.

  47. 5. Phonology So far we have said little about phonology. We have assumed no interesting probabilistic model of segment (=phoneme) placement. (0 th or 1 st order Markov model). But we can shorten the length of the grammar by taking this into consideration.

  48. 5. Phonology These slides present material done jointly with Aris Xanthos and with Jason Riggle.

  49. Much more interesting 5. Phonology model: 1-y 1-x x V C y For state transitions; and the same model for emissions: both states emit all of the symbols, but with difgerent probabilities….

  50. 1-y 5. Phonology 1-x x V C y v 1 c 1 v 8 v 2 c 8 c 2 v 3 v 7 V c 3 c 7 C v 4 c 4 v 6 c 6 v 5 c 5 ∑ ∑ = = c 1 v 1 i i i i

  51. 5. Phonology The question is… • How could we obtain the best probabilities for x and y (transition probabilities) , and all of the emission probabilities for the two states? • Bear in mind: each state generates all of the symbols. The only way to ensure that a state does not generate a symbol s is to assign a zero probability for the emission of the symbol s in that state.

  52. 5. Phonology Hidden Markov model With a well-understood training algorithm, an HMM will fjnd the optimal parameters to generate the data so as to assign it the highest probability. How does it organize the phonological data?

  53. 5. Phonology English FSA

  54. Pr (State 2  State 2) Rhythm, syllabifjcation Pr (State 1  State 1)

  55. start end

  56. English : Log ratios of the emission probabilities of the 2 states: φ p ( ) log 1 φ p ( ) 2 negative positive

  57. 5. Phonology

  58. French : Log ratios of the emission probabilities of the 2 states: φ p ( ) 1 log φ p ( ) 2 positive negative

  59. start end

  60. Finnish : Log ratios of the emission probabilities of the 2 states: φ p ( ) 1 log φ p ( ) 2 positive negative

  61. Finnish vowels 5. Phonology and their harmony

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend