SLIDE 1 Finite state morphology and phonology
Mans Hulden
mans.hulden@colorado.edu Natural Language Processing LING/CSCI 5832 Jan 22 2014
SLIDE 2 Composition
in+possible+ity im+possible+ity im+possibility impossibility
8 u i 5 a 2 s 1 d 10 + 9 e 3 e 4 m i 6 u 7 t
+ 29 u 23 s 14 p 11 l 12 i 13 k 30 e 18
r 16 e 17 t 31 t 19 s 20 s 21 i 22 b 28 l 24 t 25 r 26 a 27 n g 32 e g l y 37 i 36 a 33 n 34 e 35 s 39 s l 38 t c y
@ + m p 1 n 2 <n:m> @ m p 4 + n <n:m> 3 + p @ + m n <n:m>
@ + e i l t y 1 b @ + e i t y b 7 l 2 <l:i> 3 <e:l> 4 <+:i> 5 <i:t> 6 <t:y> <y:0> @ + i l t y b 8 e @ e i l t y 9 + b @ + e l t y b 10 i @ + e i l y b 11 t @ + e i l t b
@ <+:0>
10 u i 5 a 2 s 1 d 9 <+:0> 8 e 3 e 4 m i 6 u 7 t
33 u 27 s 13 p 24 l 22 n 11 <n:m> 12 <+:0> p 17
r 15 e 16 t 35 t 18 s 19 s 20 i 21 b 32 l 23 <+:0> u s l 25 i 26 k 34 e 28 t 29 r 30 a 31 n g 36 e g l y 41 i 40 a 37 n 38 e 39 s 43 s l 42 t c y
impossibility in+possible+ity
SLIDE 3 Composition
in+possible+ity+s im+possible+ity+s im+possibility+s impossibilities
8 u i 5 a 2 s 1 d 10 + 9 e 3 e 4 m i 6 u 7 t
+ 29 u 23 s 14 p 11 l 12 i 13 k 30 e 18
r 16 e 17 t 31 t 19 s 20 s 21 i 22 b 28 l 24 t 25 r 26 a 27 n g 32 e g l y 37 i 36 a 33 n 34 e 35 s 39 s l 38 t c y
@ + m p 1 n 2 <n:m> @ m p 4 + n <n:m> 3 + p @ + m n <n:m>
@ + e i l t y 1 b @ + e i t y b 7 l 2 <l:i> 3 <e:l> 4 <+:i> 5 <i:t> 6 <t:y> <y:0> @ + i l t y b 8 e @ e i l t y 9 + b @ + e l t y b 10 i @ + e i l y b 11 t @ + e i l t b
@ <+:0>
10 u i 5 a 2 s 1 d 9 <+:0> 8 e 3 e 4 m i 6 u 7 t
33 u 27 s 13 p 24 l 22 n 11 <n:m> 12 <+:0> p 17
r 15 e 16 t 35 t 18 s 19 s 20 i 21 b 32 l 23 <+:0> u s l 25 i 26 k 34 e 28 t 29 r 30 a 31 n g 36 e g l y 41 i 40 a 37 n 38 e 39 s 43 s l 42 t c y
impossibility
NEG+possible+ity+NOUN+PLURAL NEG+possible+ity+NOUN+PLURAL
SLIDE 4 Compilers
Several finite-state compilers available to do the hard work
- Xerox xfst (http://www.fsmbook.com)
- SFST (https://code.google.com/p/cistern/wiki/
SFST)
- HFST (http://hfst.sf.net)
- OpenFST (http://www.openfst.org)
- Foma (http://foma.googlecode.com)
Demo with foma
SLIDE 5
Toy grammar of English
Toy lexicon: kiss, hire, spy Possible suffixes: ed, ing, s Generate kiss+s/kisses, spy+ed/spied, hire+ing/ hiring, hire+ed/hired, etc.
More advanced version of this in tutorial form on: https://code.google.com/p/foma/wiki/ MorphologicalAnalysisTutorial
SLIDE 6
Some derivations
hire+ing hiring hire+ed hired kiss+s kisses hir+ing hir+ed kiss+s kisses hir+ing hir+ed Edelete Edelete Edelete EInsert EInsert EInsert Delete + Delete + Delete + Delete +
SLIDE 7 Code
def Stems s p y | k i s s | h i r e ; def Suffixes "+" [ 0 | s | e d | i n g ]; def Lexicon Stems Suffixes ; def YRule1 y -> i e || _ "+" s ; # spy+s > spie+s def YRule2 y -> i || _ "+" e d ; # spy+ed > spi+ed def Einsert "+" -> e || s _ s ; #kiss+s > kisses def Edelete e -> 0 || _ "+" [e|i]; #hire+ed > hir+ed, hire+ing > hir+ing def Cleanup "+" -> 0 ; #hir+ing >hiring, etc. def Grammar Lexicon .o. YRule1 .o. YRule2 .o. Einsert .o. Edelete .o. Cleanup; regex Grammar;
analyzer1.foma
SLIDE 8 Code
def Stems s p y | k i s s | h i r e ; def Suffixes 0:"+" [ "[INF]":0 | "[NOUN][SINGULAR]":0 | "[PRES]":s | "[NOUN][PLURAL]":s | "[PASTPART]":[e d] | "[PRESPART]":[i n g] ]; def Lexicon Stems Suffixes ; def YRule1 y -> i e || _ "+" s ; def YRule2 y -> i || _ "+" e d ; def Einsert "+" -> "+" e || s _ s ; def Edelete e -> 0 || _ "+" [e|i]; def Cleanup "+" -> 0; def Grammar Lexicon .o. YRule1 .o. YRule2 .o. Einsert .o. Edelete .o. Cleanup; regex Grammar;
analyzer2.foma
SLIDE 9 The 2 second spell checker
10 u i 5 a 2 s 1 d 9 <+:0> 8 e 3 e 4 m i 6 u 7 t
33 u 27 s 13 p 24 l 22 n 11 <n:m> 12 <+:0> p 17
r 15 e 16 t 35 t 18 s 19 s 20 i 21 b 32 l 23 <+:0> u s l 25 i 26 k 34 e 28 t 29 r 30 a 31 n g 36 e g l y 41 i 40 a 37 n 38 e 39 s 43 s l 42 t c y
impossibility
NEG+possible+ity+NOUN+PLURAL
(1) Extract the possible
“Grammar” transducer, and convert to automaton (output-side projection) (2) Test a word against automaton
SLIDE 10
The 5 second spelling corrector [med]
Assume we have a list of words as a repeating FST as before hired hired W
SLIDE 11 The 5 second spelling corrector
Assume we have a list of words as a repeating FST as before Now, create a transducer C1 that makes one change in a word (one deletion, one change,
abc ab,bc,ac,aba,aac,abca,...
@ 1 <?:0> <0:?> <?:?> @
SLIDE 12
The 5 second spelling corrector
Compose xire, hird, hird, hiredx, ired, hied,... C1 hired hired W
SLIDE 13
The 5 second spelling corrector
Compose xire, hird, hird, hiredx, ired, hied,... hired W o C1
SLIDE 14 Code
def Stems s p y | k i s s | h i r e ; def Suffixes 0:"+" [ "[INF]":0 | "[NOUN][SINGULAR]":0 | "[PRES]":s | "[NOUN] [PLURAL]":s | "[PASTPART]":[e d] | "[PRESPART]":[i n g] ]; def Lexicon Stems Suffixes ; def YRule1 y -> i e || _ "+" s ; def YRule2 y -> i || _ "+" e d ; def Epenthesis "+" -> "+" e || s _ s ; def Erule e -> 0 || _ "+" [e|i]; def Cleanup "+" -> 0; def Grammar Lexicon .o. YRule1 .o. YRule2 .o. Epenthesis .o. Erule .o. Cleanup; def C1 ?* [?:0|0:?|?:?-?] ?* ; regex Grammar.2 .o. C1;
analyzer3.foma
SLIDE 15 Entirely non-orthographic grammar
def Stems s p ɯɪ | k ɪ s | h ɯɪ r ; def Suffixes 0:"+" [ "[INF]":0 | "[PRES]":z | "[PASTPART]":[d] | "[PRESPART]":[ɪ Ŧ] ]; def Sib [s|z]; # Sibilants def Unvoiced [h|s]; # Unvoiced phonemes define ObsAssimilation d -> t || Unvoiced "+" _ ; define Epenthesis [..] -> ɪ || Sib "+" _ Sib ; define Cleanup "+" -> 0; def Lexicon Stems Suffixes ; def Grammar Lexicon .o. ObsAssimilation .o. Epenthesis .o. Cleanup; regex Grammar;
SLIDE 16
Applications
Tokenization POS tagging Shallow parsing (chunking) Syntactic parsing Information extraction Text-to-speech Spell checking/correction Electronic dictionaries Machine translation …
SLIDE 17
Syntactic parsing
SLIDE 18 Wrapup
- The above are standard techniques - morphological/
phonological grammars have been written for hundreds
- f languages in this way
- The calculus is crucial - thinking about states and
transitions is counterproductive
- A well-designed grammar should be very accurate,
barring misspellings (easily >99% recall)
- There are also probabilistic extensions to all of the
above (to combine with language models, to handle noisy data, etc.)
- These grammars are also used to improve POS-
taggers, parsers, chunkers, named entity recognition, etc.