Learning morphology and phonology John Goldsmith University of - PowerPoint PPT Presentation

2. MDL Efgect of having fewer “words” altogether f ∆ ∆ = 2 define f as log ; then pr ( C ) f 1 pr ( th ) − ∆ + ∆ + ∆ + Z Z [ t ] t [ h ] h [ th ] log 2 1 1 1 pr ( t ) pr ( h ) 2 2 This is positive if Lexicon 2 is better

2. MDL Efgect of frequency of /t/ and /h/ decreasing f ∆ ∆ = 2 define f as log ; then pr ( C ) f 1 pr ( th ) − ∆ + ∆ + ∆ + Z Z [ t ] t [ h ] h [ th ] log 2 1 1 1 pr ( t ) pr ( h ) 2 2 This is positive if Lexicon 2 is better

2. MDL Efgect /th/ being treated as a unit rather than separate pieces f ∆ ∆ = 2 define f as log ; then pr ( C ) f 1 pr ( th ) − ∆ + ∆ + ∆ + Z Z [ t ] t [ h ] h [ th ] log 2 1 1 1 pr ( t ) pr ( h ) 2 2 This is positive if Lexicon 2 is better

2. MDL 2.3 Results • The Fulton County Grand Ju ry s aid Friday an investi gation of At l anta 's recent prim ary e lection produc ed no e videnc e that any ir regul ar it i e s took place . • Thejury further s aid in term - end present ment s thatthe City Ex ecutive Commit t e e ,which had over - all charg e ofthe e lection , d e serv e s the pra is e and than k softhe City of At l anta forthe man ner in whichthe e lection was conduc ted. Chunks are too big Chunks are too small

2. MDL Summary 1. Word segmentation is possible , using (1) variable length strings ( multigrams) , (2) a probabilistic model of a corpus and (3) a search for maximum likelihood, if (4) we use MDL to tell us when to stop adding to the lexicon. 2. The results are interesting , but they sufger from being incapable of modeling real linguistic structure beyond simple chunks.

2. MDL Question: Will we fjnd that types of linguistic structure correspond naturally to ways of improving our MDL model, either to increase the probability of the data , or to decrease the size of the grammar ?

3. Morphology ( primo ) Problem: Given a set of words, fjnd the best morphological structure for the words – where “best” means it maximally agrees with linguists (where they agree with each other!). Because we are going from larger units to smaller units (words to morphemes), the probability of the data is certain to decrease . The improvement will come from drastically shortening the grammar = discover regularities.

3. Morphology Naïve MDL Corpus: Analysis: jump, jumps, Stems : jump laugh jumping sing sang dog (20 letters) laugh, laughed, laughing Suffjxes : s ing ed (6 letters) sing, sang, singing Unanalyzed : the (3 the, dog, dogs letters) total: 62 letters total: 29 letters.

3. Morphology Model/heuristic 1 st approximation: a Unlike the word morphology is: segmentation problem, now we 1. a list of stems, have no obvious 2. a list of affjxes search heuristics. (prefjxes, These are very suffjxes), and important (for that 3. a list of pointers reason)—and I will indicating which not talk about combinations are them. permissible.

3. Morphology Size of model M[orphology] = { Stems T, Affjxes F, Signatures Σ } = + + Σ M T F ∑ = T t stems s = string length ( s ) * log( 26 ) ∈ t T ∑ | s | | s | = affjxes F f ∑ ∑ = = − or s [ i ] log freq ( s [ i ]) i = 1 i = 1 ∈ f F What is a signature, ∑ sig’s Σ = σ and what is its length? σ ∈ T extensivit y

3. Morphology What is a signature? account élevé NULL       NULL         appeal équipé e         ed         attack étonnant s         ing         40 more ... 78 more es      

What is the length (=information content) of a signature? A signature is an ordered pair of two sets of pointers: (i) a set of pointers to stems; and (ii) a set of pointers to affjxes. The length of a pointer p is –log freq ( p) : σ [ W ] [ ] ∑ ∑ ∑ + So the total length of the signatures is: ( log log ) σ [ t ] [ f in ] σ ∈ ∈ σ ∈ σ Sigs t Stems ( ) f Suffixes ( ) Sum over signatures Sum over stem ptrs

3. Morphology Generation 1 Linguistica http://linguistica.uchicago.edu Initial pass: assumes that words are composed of 1 or 2 morphemes; fjnds all cases where signatures exist with at least 2 stems and 2 affjxes:  NULL   jump    ed     walk     ing  

3. Morphology Generation 1 Then it refjnes this initial approximation in a large number of ways, always trying to decrease the description length of the initial corpus.

3. Morphology Refjnements 1. Correct errors in segmentation  affirmati   affirm      aggressi on aggress ion         ⇒         attenti ve attent ive             20 more 20 more     2. Create signatures with only one observed stem: we have NULL, ed, ion, s as suffjxes, but only one stem ( act ) with exactly those suffjxes.

3. Morphology 3. Find recursive structure: allow stems to be analyzed. Minilexicon 1 Words 1 Affixes Stems 1 Signa- tures 1 Minilexicon 2 Words 2 = Stems 1 Affixes 2 Stems 2 Signa- tures 2

French roots 3. Morphology

3. Morphology 4. Detect allomorphy Signature: <e>ion . NULL composite concentrate corporate détente discriminate evacuate infmate opposite participate probate prosecute tense What is this? composite and composition composite  composit  composit + ion It infers that ion deletes a stem-fjnal ‘e’ before attaching.

3. Morphology 3. Summary Works very well on European languages. Challenges: 1. Works very poorly on languages with richer morphologies (average # morphemes/word >> 2 ). ( Most languages have rich morphologies .) 2. Various other defjciencies.

4. Morphology ( secundo ) The initial bootstrap in the previous version does not even work on most languages, where the expected morphology contains sequences of 5 or more morphemes.

4. Morphology Swahili verb imb ni pend ni li u w fik ku a ka sem Ø a m tu som ta tu w wa taka on a l na chaku

4. Morphology Swahili verb imb ni pend ni li u w fik ku a ka sem Ø a m tu som ta tu wa taka on wa l na chaku Subject marker

4. Morphology Swahili verb imb ni pend ni li u w fik ku a ka sem Ø a m tu som ta tu taka wa on wa l Subject na marker chaku Tense marker

4. Morphology Swahili verb imb ni pend ni li u w fik ku a ka sem Ø a m tu som ta tu wa taka on wa l Subject na marker chaku Tense marker Object marker

4. Morphology Swahili verb imb ni pend ni li u w fik ku a ka sem Ø a m tu som ta tu wa taka on wa l Subject Object na marker marker chaku Tense wa marker Root

4. Morphology Swahili verb imb ni pend ni li u w fik ku a ka sem Ø a m tu som ta tu wa taka on wa l Subject Object na marker marker chaku Tense Root marker Voice (active/passive)

4. Morphology Swahili verb imb ni pend ni li u w fik ku a sem Ø a m ka tu som ta tu wa taka on wa l Subject Object na marker marker chaku Tense Root marker Voice (active/passive) Final vowel

4. Morphology Finite state automaton (FSA) SF 1 PF 1 SF 2 SF 3

Signature: 4. Morphology reduces false positives NULL   jump     ed     walk     ing   PF 1 SF 1 SF 2 PF 3 SF 3

4. Morphology Generalize the signature… M 1 M 4 M 7 M 2 M 5 M 8 M 3 M 6 M 9 Sequential FSA: each state has a unique successor.

4. Morphology Alignments m 2 1 m 1 2 m 4 4 4 3 m 3

Alignments: String edit distance algorithm n i l i m u p e n d a n i t a k a m u p e n d a

4. Morphology Alignments: make cuts n i l i m u p e n d a n i t a k a m u p e n d a

4. Morphology Elementary alignment m 2 1 m 1 2 m 4 4 4 3 m 3

Collapsing elementary alignments m 2 1 m 1 2 m 4 4 4 3 m 3 context context m 7 1 m 1 2 m 4 4 4 3 m 8

T wo or more sequential FSAs with identical contexts are collapsed: m 7 m 2 1 m 1 2 m 4 4 4 3 m 3 m 8

3. Further collapsing FSAs li a yesema 1 2 4 4 3 na li a mfuata 1 2 4 4 3 na li yesema a 1 2 4 4 3 na mfuata

4.3 T op templates: 8,200 Swahili words State 1 State 2 State 3 a, wa (sg., pl. human 246 stems subject markers) ku, hu (infinitive, 51 stems habitual markers) wa (pl. subject ka, li (tense 25 stems marker) markers) a (sg. subject ka, li (tense 29 stems marker) markers) a (sg. subject ka, na (tense 28 stems marker) markers) 37 strings w (passive marker) / Ø a

4. Morphology Precision and recall Precision Recall F-score String edit 0.77 0.57 0.65 distance Stem- 0.54 0.14 0.22 affix Affix- 0.68 0.20 0.31 stem

Collapsed templates One Template The other template Collapsed Template % found on Yahoo search 1 {a}-{ka,na}- {a}-{ka,ki}-{stems} {a}-{ka,ki,na}-{stems} 86 (37/43) {stems} 2 {wa}-{ka,na}- {wa}-{ka,ki}-{stems} {wa}-{ka,ki,na}-{stems} 95 (21/22) {stems} 3 {a}-{ka,ki,na}- {wa}-{ka,ki,na}- {a,wa}-{ka,ki,na}- 84 (154/183) {stems} {stems} {stems} 4 {a}-{liye,me}- {a}-{liye,li}-{stems} {a}-{liye,li,me}-{stems} 100 (21/21) {stems} 5 {a}-{ki,li}-{stems} {wa}-{ki,li}-{stems} {a,wa}-{ki,li}-{stems} 90 (36/40) 6 {a}-{lipo,li}- {wa}-{lipo,li}-{stems} {a,wa}-{lipo,li}-{stems} 90 (27/30) {stems} 7 {a,wa}-{ki,li}- {a,wa}-{lipo,li}- {a,wa}-{ki,lipo,li}- 74 (52/70) {stems} {stems} {stems} 8 {a}-{na,naye}- {a}-{na,ta}-{stems} {a}-{na,ta,naye}-{stems} 80 (12/15) {stems}

4. 1 Evaluating the robustness of these templates (sequential FSAs) • Measure: How many letters do we save by expressing words in a template rather than by writing each one out individually? Answer: 36 -17 = 19. li yesema a 1 2 4 4 3 na mfuata

Most edges are 4. Morphology convergent … adjectives car- s -a- rubi- negr- -o- Ø pequeñ- -o com -es -e cre- verbs -emos

But some diverge (Spanish): Participle-forming suffix

4. Morphology English has much the same:

4. Morphology 4. Summary We need to enrich the heuristics and consider a broader set of possible grammars. With that, improvements seem to be unlimited at this point in time. Focus: Decrease the length of the analysis, especially in the length of the substance (morphemes) described.

5. Phonology So far we have said little about phonology. We have assumed no interesting probabilistic model of segment (=phoneme) placement. (0 th or 1 st order Markov model). But we can shorten the length of the grammar by taking this into consideration.

5. Phonology These slides present material done jointly with Aris Xanthos and with Jason Riggle.

Much more interesting 5. Phonology model: 1-y 1-x x V C y For state transitions; and the same model for emissions: both states emit all of the symbols, but with difgerent probabilities….

1-y 5. Phonology 1-x x V C y v 1 c 1 v 8 v 2 c 8 c 2 v 3 v 7 V c 3 c 7 C v 4 c 4 v 6 c 6 v 5 c 5 ∑ ∑ = = c 1 v 1 i i i i

5. Phonology The question is… • How could we obtain the best probabilities for x and y (transition probabilities) , and all of the emission probabilities for the two states? • Bear in mind: each state generates all of the symbols. The only way to ensure that a state does not generate a symbol s is to assign a zero probability for the emission of the symbol s in that state.

5. Phonology Hidden Markov model With a well-understood training algorithm, an HMM will fjnd the optimal parameters to generate the data so as to assign it the highest probability. How does it organize the phonological data?

5. Phonology English FSA

Pr (State 2  State 2) Rhythm, syllabifjcation Pr (State 1  State 1)

start end

English : Log ratios of the emission probabilities of the 2 states: φ p ( ) log 1 φ p ( ) 2 negative positive

5. Phonology

French : Log ratios of the emission probabilities of the 2 states: φ p ( ) 1 log φ p ( ) 2 positive negative

start end

Finnish : Log ratios of the emission probabilities of the 2 states: φ p ( ) 1 log φ p ( ) 2 positive negative

Finnish vowels 5. Phonology and their harmony

Learning morphology and phonology John Goldsmith University of - PowerPoint PPT Presentation

Learning morphology and phonology John Goldsmith University of Chicago MoDyCo/Paris X Learning morphology and phonology John Goldsmith University of Chicago MoDyCo/Paris X All the particular properties that give a language its unique

Lexical Phonology and Morphology February 4, 2016 Lexical Phonology and Morphology Paul

P honology Darrell Larsen Linguistics 101 Darrell Larsen Phonology Understanding Phonology

Morphology Morphology Morphology yields words with Morphology yields words with predictable

1 17 January 2009 Workshop on the Division of Labour between Morphology and Phonology Sharon

Computational Morphology: Machine learning of morphology Yulia Zinova 09 April 2014 16 July

Phonetics-phonology The phonetics-phonology interface: basic assumptions mismatches

Learning Phonology LINGUIST 397LH Oiry/Hartman Learning phonology

Update on morphology WP activities M. Huertas-Company (GAL-SWG - morphology) EUCLID France - 7

Finite state morphology and phonology Natural Language Processing LING/CSCI 5832 Mans Hulden

Words: Computational Morphology and Phonology CMSC 35100 Natural Language Processing April 8,

PHONOLOGY AND PHONETICS Phonology is often conceptualized as categorical sound patterns

Phonology II: derivations, rules, phonotactics John Goldsmith LING 20001 17 October 2011 ()

Two kinds of phonology John Goldsmith February 26, 2015 Contents 1 Pure phonology 2 1.1

Phonology 9/10/2010 Key Words / Concepts Phonology vs. phonetics Phoneme vs. allophone

Autosegmental phonology John A Goldsmith February 23, 2016 1 Autosegmental Phonology 1976: 2

Morphology and Corpora: Introduction Marco Baroni University of Bologna Granada Morphology

Jai Narayan Tripathi -IIT Bombay, INDIA.Rajkumar Nagpal - STMicroelectronics, INDIA.Rakesh Malik -

Continuous Improvement Toolkit Design of Experiment (Introduction) Continuous Improvement Toolkit

Method Based on Morphological Analysis, Clustering and the Mahalanobis-Taguchi Method Hirohisa

s rts t ss

Architectural Fractals Daniel Lordick Institute of Geometry Dresden University of Technology,

Aliyah Overview August 2016 ,

Detect ctor Charact cterization fo for the underground gr gravitational-wave detect ctor,

Gr obner Bases and Holonomic Gradient Method Evaluation of A -Hypergeometric Polynomials .