Finite state morphology and phonology Natural Language Processing - - PowerPoint PPT Presentation

finite state morphology and phonology
SMART_READER_LITE
LIVE PREVIEW

Finite state morphology and phonology Natural Language Processing - - PowerPoint PPT Presentation

Finite state morphology and phonology Natural Language Processing LING/CSCI 5832 Mans Hulden Dept. of Linguistics mans.hulden@colorado.edu Jan 22 2014 Composition in+possible+ity r a n t 24 25 26 27 g 23 e s 33 34 35 s s i


slide-1
SLIDE 1

Finite state morphology and phonology

Mans Hulden

  • Dept. of Linguistics

mans.hulden@colorado.edu Natural Language Processing LING/CSCI 5832 Jan 22 2014

slide-2
SLIDE 2

Composition

in+possible+ity im+possible+ity im+possibility impossibility

8 u i 5 a 2 s 1 d 10 + 9 e 3 e 4 m i 6 u 7 t
  • n
+ 29 u 23 s 14 p 11 l 12 i 13 k 30 e 18
  • 15
r 16 e 17 t 31 t 19 s 20 s 21 i 22 b 28 l 24 t 25 r 26 a 27 n g 32 e g l y 37 i 36 a 33 n 34 e 35 s 39 s l 38 t c y

@ + m p 1 n 2 <n:m> @ m p 4 + n <n:m> 3 + p @ + m n <n:m>

@ + e i l t y 1 b @ + e i t y b 7 l 2 <l:i> 3 <e:l> 4 <+:i> 5 <i:t> 6 <t:y> <y:0> @ + i l t y b 8 e @ e i l t y 9 + b @ + e l t y b 10 i @ + e i l y b 11 t @ + e i l t b

@ <+:0>

10 u i 5 a 2 s 1 d 9 <+:0> 8 e 3 e 4 m i 6 u 7 t
  • <+:0>
33 u 27 s 13 p 24 l 22 n 11 <n:m> 12 <+:0> p 17
  • 14
r 15 e 16 t 35 t 18 s 19 s 20 i 21 b 32 l 23 <+:0> u s l 25 i 26 k 34 e 28 t 29 r 30 a 31 n g 36 e g l y 41 i 40 a 37 n 38 e 39 s 43 s l 42 t c y

impossibility in+possible+ity

slide-3
SLIDE 3

Composition

in+possible+ity+s im+possible+ity+s im+possibility+s impossibilities

8 u i 5 a 2 s 1 d 10 + 9 e 3 e 4 m i 6 u 7 t
  • n
+ 29 u 23 s 14 p 11 l 12 i 13 k 30 e 18
  • 15
r 16 e 17 t 31 t 19 s 20 s 21 i 22 b 28 l 24 t 25 r 26 a 27 n g 32 e g l y 37 i 36 a 33 n 34 e 35 s 39 s l 38 t c y

@ + m p 1 n 2 <n:m> @ m p 4 + n <n:m> 3 + p @ + m n <n:m>

@ + e i l t y 1 b @ + e i t y b 7 l 2 <l:i> 3 <e:l> 4 <+:i> 5 <i:t> 6 <t:y> <y:0> @ + i l t y b 8 e @ e i l t y 9 + b @ + e l t y b 10 i @ + e i l y b 11 t @ + e i l t b

@ <+:0>

10 u i 5 a 2 s 1 d 9 <+:0> 8 e 3 e 4 m i 6 u 7 t
  • <+:0>
33 u 27 s 13 p 24 l 22 n 11 <n:m> 12 <+:0> p 17
  • 14
r 15 e 16 t 35 t 18 s 19 s 20 i 21 b 32 l 23 <+:0> u s l 25 i 26 k 34 e 28 t 29 r 30 a 31 n g 36 e g l y 41 i 40 a 37 n 38 e 39 s 43 s l 42 t c y

impossibility

NEG+possible+ity+NOUN+PLURAL NEG+possible+ity+NOUN+PLURAL

slide-4
SLIDE 4

Compilers

Several finite-state compilers available to do the hard work

  • Xerox xfst (http://www.fsmbook.com)
  • SFST (https://code.google.com/p/cistern/wiki/

SFST)

  • HFST (http://hfst.sf.net)
  • OpenFST (http://www.openfst.org)
  • Foma (http://foma.googlecode.com)

Demo with foma

slide-5
SLIDE 5

Toy grammar of English

Toy lexicon: kiss, hire, spy Possible suffixes: ed, ing, s Generate kiss+s/kisses, spy+ed/spied, hire+ing/ hiring, hire+ed/hired, etc.

More advanced version of this in tutorial form on: https://code.google.com/p/foma/wiki/ MorphologicalAnalysisTutorial

slide-6
SLIDE 6

Some derivations

hire+ing hiring hire+ed hired kiss+s kisses hir+ing hir+ed kiss+s kisses hir+ing hir+ed Edelete Edelete Edelete EInsert EInsert EInsert Delete + Delete + Delete + Delete +

slide-7
SLIDE 7

Code

def Stems s p y | k i s s | h i r e ; def Suffixes "+" [ 0 | s | e d | i n g ]; def Lexicon Stems Suffixes ; def YRule1 y -> i e || _ "+" s ; # spy+s > spie+s def YRule2 y -> i || _ "+" e d ; # spy+ed > spi+ed def Einsert "+" -> e || s _ s ; #kiss+s > kisses def Edelete e -> 0 || _ "+" [e|i]; #hire+ed > hir+ed, hire+ing > hir+ing def Cleanup "+" -> 0 ; #hir+ing >hiring, etc. def Grammar Lexicon .o. YRule1 .o. YRule2 .o. Einsert .o. Edelete .o. Cleanup; regex Grammar;

analyzer1.foma

slide-8
SLIDE 8

Code

def Stems s p y | k i s s | h i r e ; def Suffixes 0:"+" [ "[INF]":0 | "[NOUN][SINGULAR]":0 | "[PRES]":s | "[NOUN][PLURAL]":s | "[PASTPART]":[e d] | "[PRESPART]":[i n g] ]; def Lexicon Stems Suffixes ; def YRule1 y -> i e || _ "+" s ; def YRule2 y -> i || _ "+" e d ; def Einsert "+" -> "+" e || s _ s ; def Edelete e -> 0 || _ "+" [e|i]; def Cleanup "+" -> 0; def Grammar Lexicon .o. YRule1 .o. YRule2 .o. Einsert .o. Edelete .o. Cleanup; regex Grammar;

analyzer2.foma

slide-9
SLIDE 9

The 2 second spell checker

10 u i 5 a 2 s 1 d 9 <+:0> 8 e 3 e 4 m i 6 u 7 t
  • <+:0>
33 u 27 s 13 p 24 l 22 n 11 <n:m> 12 <+:0> p 17
  • 14
r 15 e 16 t 35 t 18 s 19 s 20 i 21 b 32 l 23 <+:0> u s l 25 i 26 k 34 e 28 t 29 r 30 a 31 n g 36 e g l y 41 i 40 a 37 n 38 e 39 s 43 s l 42 t c y

impossibility

NEG+possible+ity+NOUN+PLURAL

(1) Extract the possible

  • utputs of the

“Grammar” transducer, and convert to automaton (output-side projection) (2) Test a word against automaton

slide-10
SLIDE 10

The 5 second spelling corrector [med]

Assume we have a list of words as a repeating FST as before hired hired W

slide-11
SLIDE 11

The 5 second spelling corrector

Assume we have a list of words as a repeating FST as before Now, create a transducer C1 that makes one change in a word (one deletion, one change,

  • ne insertion)

abc ab,bc,ac,aba,aac,abca,...

@ 1 <?:0> <0:?> <?:?> @

slide-12
SLIDE 12

The 5 second spelling corrector

Compose xire, hird, hird, hiredx, ired, hied,... C1 hired hired W

slide-13
SLIDE 13

The 5 second spelling corrector

Compose xire, hird, hird, hiredx, ired, hied,... hired W o C1

slide-14
SLIDE 14

Code

def Stems s p y | k i s s | h i r e ; def Suffixes 0:"+" [ "[INF]":0 | "[NOUN][SINGULAR]":0 | "[PRES]":s | "[NOUN] [PLURAL]":s | "[PASTPART]":[e d] | "[PRESPART]":[i n g] ]; def Lexicon Stems Suffixes ; def YRule1 y -> i e || _ "+" s ; def YRule2 y -> i || _ "+" e d ; def Epenthesis "+" -> "+" e || s _ s ; def Erule e -> 0 || _ "+" [e|i]; def Cleanup "+" -> 0; def Grammar Lexicon .o. YRule1 .o. YRule2 .o. Epenthesis .o. Erule .o. Cleanup; def C1 ?* [?:0|0:?|?:?-?] ?* ; regex Grammar.2 .o. C1;

analyzer3.foma

slide-15
SLIDE 15

Entirely non-orthographic grammar

def Stems s p ɯɪ | k ɪ s | h ɯɪ r ; def Suffixes 0:"+" [ "[INF]":0 | "[PRES]":z | "[PASTPART]":[d] | "[PRESPART]":[ɪ Ŧ] ]; def Sib [s|z]; # Sibilants def Unvoiced [h|s]; # Unvoiced phonemes define ObsAssimilation d -> t || Unvoiced "+" _ ; define Epenthesis [..] -> ɪ || Sib "+" _ Sib ; define Cleanup "+" -> 0; def Lexicon Stems Suffixes ; def Grammar Lexicon .o. ObsAssimilation .o. Epenthesis .o. Cleanup; regex Grammar;

slide-16
SLIDE 16

Applications

Tokenization POS tagging Shallow parsing (chunking) Syntactic parsing Information extraction Text-to-speech Spell checking/correction Electronic dictionaries Machine translation …

slide-17
SLIDE 17

Syntactic parsing

slide-18
SLIDE 18

Wrapup

  • The above are standard techniques - morphological/

phonological grammars have been written for hundreds

  • f languages in this way
  • The calculus is crucial - thinking about states and

transitions is counterproductive

  • A well-designed grammar should be very accurate,

barring misspellings (easily >99% recall)

  • There are also probabilistic extensions to all of the

above (to combine with language models, to handle noisy data, etc.)

  • These grammars are also used to improve POS-

taggers, parsers, chunkers, named entity recognition, etc.