Finite state morphology and phonology Natural Language Processing - - PowerPoint PPT Presentation

finite state morphology and phonology
SMART_READER_LITE
LIVE PREVIEW

Finite state morphology and phonology Natural Language Processing - - PowerPoint PPT Presentation

Finite state morphology and phonology Natural Language Processing LING/CSCI 5832 Mans Hulden Dept. of Linguistics mans.hulden@colorado.edu Jan 20 2014 FSMs for practical NLP tasks (1) How FSMs are used in modeling sound systems (phonology)


slide-1
SLIDE 1

Finite state morphology and phonology

Mans Hulden

  • Dept. of Linguistics

mans.hulden@colorado.edu Natural Language Processing LING/CSCI 5832 Jan 20 2014

slide-2
SLIDE 2

FSMs for practical NLP tasks

(1) How FSMs are used in modeling sound systems (phonology) (2) For modeling word-formation (3) Derivative products of the above (spell checkers, lemmatizers, grammar checkers, components of larger systems)

slide-3
SLIDE 3

Plan

(1) Recap finite automata and transducers + basic algorithms (2) Look at an extended calculus for manipulating FSMs (automata + transducers) suitable for NLP (3) See how these are used in natural language applications

slide-4
SLIDE 4

Recap: anatomy of a FSA

L = a b* c a c b

1 2

Regular expression Graph representation Q = {0,1,2} (set of states) Σ = {a,b,c} (alphabet)
 q0 = 0 (initial state) F = {2} (set of final states) δ(0,a) = 1, δ(1,b) = 1, δ(1,c) = 2 (transition function) Formal definition

slide-5
SLIDE 5

Recap: anatomy of a FSA

L = a b* c a c b

1 2

Regular expression Graph representation Interpretation

  • An FSA defines a set of strings
  • In this case L={ac,abc,abbc,...}
  • These sets are the regular sets
slide-6
SLIDE 6

Recap: Kleene’s Theorem

A language is regular iff it is accepted by some FA

6 a 5 b 1 c 2 a b c a 3 b c 4 a b c a b c 10 a 7 b c a 8 b c 9 a b c a b c

(a|b* c)* a b a* | (a b* a | a a*)

Proof is constructive: can convert between representations

slide-7
SLIDE 7

Recap: Kleene’s Theorem

Expression Definition FSM construction ✏ The empty string ∅ The empty language a A single symbol A∗ Kleene star of a language AB Concatenation of two languages A | B Union of two languages

Kleene’s Theorem: regexp → FA FA → regexp done with “state elimination algorithm” (easier, but let’s skip it)

slide-8
SLIDE 8

The Thompson construction

a (a|b)* b

slide-9
SLIDE 9

The Thompson construction

a (a|b)* b a b ϵ ϵ ϵ ϵ (a|b)

slide-10
SLIDE 10

The Thompson construction

a (a|b)* b a b ϵ ϵ ϵ ϵ (a|b)* ϵ ϵ

slide-11
SLIDE 11

The Thompson construction

(a|b)* a b ϵ ϵ ϵ ϵ ϵ ϵ a,b determinization & minimization algorithm

slide-12
SLIDE 12

Recap: Kleene’s Theorem

  • Kleene’s Theorem only uses one Boolean
  • peration on sets, union
  • But FSA are closed under other set operations:

complement, intersection, set subtraction

  • It’s difficult to appreciate the power of finite-

state models without a richer calculus...

  • In fact, the most fruitful approach is to forget

about states and transitions and tapes and reason in terms of sets and relations

slide-13
SLIDE 13

Reasoning about automata

1 a b c a 2 b c a b

Σ = {a,b,c} Automaton What language does the FSA represent?

slide-14
SLIDE 14

Reasoning about automata

1 a b c a 2 b c a b

(b|c|aa*c)*aa*b(aa*b|(b|aa*c)(b|c|aa*c)*aa*b)*|(b|c)* a((a|ba)|(c|bb)(b|c)*a)*|(b|c|a(a|ba)*(c|bb))*

Σ = {a,b,c} Automaton Equivalent regular expression with {|,•, *}

slide-15
SLIDE 15

Reasoning about automata

1 a b c a 2 b c a b

(b|c|aa*c)*aa*b(aa*b|(b|aa*c)(b|c|aa*c)*aa*b)*|(b|c)* a((a|ba)|(c|bb)(b|c)*a)*|(b|c|a(a|ba)*(c|bb))*

¬(Σ*abcΣ*) Σ = {a,b,c} Automaton Equivalent regular expression with {|,•, *} Equivalent regular expression with {•,¬,*}

slide-16
SLIDE 16

Reasoning about automata

1 a b c a 2 b c a b

(b|c|aa*c)*aa*b(aa*b|(b|aa*c)(b|c|aa*c)*aa*b)*|(b|c)* a((a|ba)|(c|bb)(b|c)*a)*|(b|c|a(a|ba)*(c|bb))*

¬(Σ*abcΣ*) Σ = {a,b,c} Automaton Equivalent regular expression with {|,•, *} Equivalent regular expression with {|,•,¬} not “contains abc”

slide-17
SLIDE 17

Reasoning about automata

The common data structures that our programs manipulate are clearly states, transitions, labels, and label pairs—the building blocks of finite automata and

  • transducers. But many of our initial mistakes and failures arose from attempt-

ing also to think in terms of these objects. The automata required to implement even the simplest examples are large and involve considerable subtlety for their

  • construction. To view them from the perspective of states and transitions is

much like predicting weather patterns by studying the movements of atoms and molecules or inverting a matrix with a Turing machine. The only hope of success in this domain lies in developing an appropriate set of high-level alge- braic operators for reasoning about languages and relations and for justifying a corresponding set of operators and automata for computation. (Kaplan and Kay, 1994, p.376)

From “Regular models of phonological rule systems”

slide-18
SLIDE 18

Toward “high-level” algebraic

  • perators
  • Add Booleans to regular expression calculus:

at least complement (¬), intersection (∩), set subtraction (-))

  • Add “useful” operators/shortcuts, e.g.
  • contains(X) = (Σ* X Σ*)
  • Example: the language that fulfills the

constraint: “i before e except after c” ¬contains(cie) & ¬(¬(Σ*c)ei)

slide-19
SLIDE 19

The product construction

a c b

1 2

a b c

1 2

L1 = a b* c L2 = a b c* L3 = L1 ∩ L2

slide-20
SLIDE 20

The product construction

a c b

1 2

a b c

1 2

L1 = a b* c L2 = a b c*

(0,0)

L3 = L1 ∩ L2

slide-21
SLIDE 21

The product construction

a c b

1 2

a b c

1 2

L1 = a b* c L2 = a b c*

(0,0) a

L3 = L1 ∩ L2

(1,1)

slide-22
SLIDE 22

The product construction

a c b

1 2

a b c

1 2

L1 = a b* c L2 = a b c*

(0,0) a

L3 = L1 ∩ L2

(1,1) b (1,2)

slide-23
SLIDE 23

The product construction

a c b

1 2

a b c

1 2

L1 = a b* c L2 = a b c*

(0,0) a

L3 = L1 ∩ L2

(1,1) b (1,2) c (2,2)

slide-24
SLIDE 24

The product construction

Algorithm 3.2: PRODUCTCONSTRUCTION Input: FSM1 = (Q1, Σ, 1, s0, F1), FSM2 = (Q2, Σ, 2, t0, F2), OP 2 {[, \, } Output: FSM3 = (Q3, Σ, 3, u0, F3) begin

1

Agenda (s0, t0)

2

Q3 (s0, t0)

3

u0 (s0, t0)

4

index (s0, t0)

5

while Agenda 6= ; do

6

Choose a state pair (p, q) from Agenda

7

foreach pair of transitions 1(p, x, p0) 2(q, x, q0) do

8

Add 3((p, q), x, (p0, q0))

9

if (p’,q’) is not indexed then

10

Index (p0, q0) and add to Agenda and Q3

11

end

12

end

13

end

14

foreach State s in Q3 = (p, q) do

15

Add s to F3 iff p 2 F1 OP q 2 F2

16

end

17

end

18

slide-25
SLIDE 25

Finite state transducers

slide-26
SLIDE 26

Recap: anatomy of an FST

Q = {0,1,2,3} (set of states) Σ = {a,b,c,d} (alphabet)
 q0 = 0 (initial state) F = {0,1,2} (set of final states) δ (transition function) Formal definition Graph representation

a b d 1 c 2 a 3 <a:b> b d c a b c d

slide-27
SLIDE 27

Recap: anatomy of an FST

Interpretation Graph representation

a b d 1 c 2 a 3 <a:b> b d c a b c d

  • An FST defines a set of string

pairs (a relation)

  • In this case T={(a,a),(b,b),(c,c),

(cad,cdb),...}

  • These sets are the regular

relations

  • Trivially bidirectional devices
slide-28
SLIDE 28

Algebraic operations on transducers

T U (concatenation) T | U (union) T* (Kleene closure) rev(T) (reversal) L1 x L2 (cross-product) T o U (composition)

slide-29
SLIDE 29

Algebraic operations on transducers

T U (concatenation) T | U (union) T* (Kleene closure) rev(T) (reversal) L1 x L2 (cross-product) T o U (composition) (ab|ac) x (c|d) ab ac c d

1 <a:c> <a:d> 2 <b:0> <c:0>

Cross-product

Regular languages

slide-30
SLIDE 30

Algebraic operations on transducers

T U (concatenation) T | U (union) T* (Kleene closure) rev(T) (reversal) L1 x L2 (cross-product) T o U (composition) Composition T

x y

U

z

T ○ U

x z

slide-31
SLIDE 31

Composition: product construction

a:b c:d

1 2 (0,0) a:x

T3 = T1 o T2

(1,0) c:d (2,0)

b:x d:d T1 T2

slide-32
SLIDE 32

String rewriting operators

A →B / C _ D “Rewrite strings A as B when

  • ccuring between C and D’’

Example: (a|e|i|o|u) → 0 / _ # delete vowels at the ends of words

1 a e i o u 2 <a:0> <e:0> <i:0> <o:0> <u:0> b c d f p t k a e i o u <a:0> <e:0> <i:0> <o:0> <u:0> b c d f p t k

Difficult to implement correctly in the general case

slide-33
SLIDE 33

Modeling morphology and phonology

epäjärjestelmällistyttämättömyydelläänsäkäänköhän

Actual single Finnish word (not a compound!) ‘perhaps even because of his/her/it not having an ability to not generalize herself/himself/itself’ (maybe) Grammatically correct, semantics is elusive, akin to ‘colorless green ideas sleep furiously’ Highly agglutinative languages like this have an astronomical number of “possible words”, even without considering neologisms

slide-34
SLIDE 34

Linguistics: a model of word production

UNDERLYING REPRESENTATION Lexical rules Postlexical rules LEXICAL REPRESENTATION SURFACE REPRESENTATION ↓ ↓ (13)

epäjärjestelmällistyttämättömyydelläänsäkäänköhän

epä+järjestelmä+lis+... Modeled by a step-by-step generative process:

epäjärjestelmällistyttämättömyydelläänsäkäänköhän

put morphemes together

‘un’+‘system‘ +‘ize’

phonemes and morphemes change when they are conjoined, modeled by phonological rules.

slide-35
SLIDE 35

“Generative” word model

in+possible+ity (1) Pick morphemes from lexicon in right order and combinations (dictated by morphotactics)

slide-36
SLIDE 36

“Generative” word model

in+possible+ity im+possible+ity (1) Pick morphemes from lexicon in right order and combinations (dictated by morphotactics)

change n to m before p (nasal assimilation)

(2) Apply sound change rules +

  • rthographic rules
slide-37
SLIDE 37

“Generative” word model

in+possible+ity im+possible+ity (1) Pick morphemes from lexicon in right order and combinations (dictated by morphotactics)

change n to m before p (nasal assimilation)

im+possibility

ble+ity > bility

impossibility

remove boundaries

(2) Apply sound change rules +

  • rthographic rules
slide-38
SLIDE 38

Four tricks to model this

(1) Extended operators (Booleans, replacements) (2) Use alphabet independent algorithms (3) Treat automata as “repeating transducers” (“everything is a transducer”) (4) Model lexicon as an FST (which may just repeat words)

@ 1 a @ a

Σ* a Σ*

@ 1 a @ a

  • (repeat every word that contains at least one a)
slide-39
SLIDE 39

“Generative” word model

in+possible+ity im+possible+ity (1) Pick morphemes from lexicon in right order and combinations (dictated by morphotactics)

change n to m before p (nasal assimilation)

im+possibility

ble+ity → bility

impossibility

remove boundaries

(2) Apply sound change rules +

  • rthographic rules

Lexicon + morphology

slide-40
SLIDE 40

“Generative” word model

in+possible+ity im+possible+ity (1) Pick morphemes from lexicon in right order and combinations (dictated by morphotactics)

n →m / _ + p

im+possibility

ble+ity →bility

impossibility

+ → 0

(2) Apply sound change rules +

  • rthographic rules

Lexicon + morphology

slide-41
SLIDE 41

“Generative” word model

in+possible+ity im+possible+ity (1) Pick morphemes from lexicon in right order and combinations (dictated by morphotactics) im+possibility impossibility (2) Apply sound change rules +

  • rthographical rules
8 u i 5 a 2 s 1 d 10 + 9 e 3 e 4 m i 6 u 7 t
  • n
+ 29 u 23 s 14 p 11 l 12 i 13 k 30 e 18
  • 15
r 16 e 17 t 31 t 19 s 20 s 21 i 22 b 28 l 24 t 25 r 26 a 27 n g 32 e g l y 37 i 36 a 33 n 34 e 35 s 39 s l 38 t c y

@ + m p 1 n 2 <n:m> @ m p 4 + n <n:m> 3 + p @ + m n <n:m>

@ + e i l t y 1 b @ + e i t y b 7 l 2 <l:i> 3 <e:l> 4 <+:i> 5 <i:t> 6 <t:y> <y:0> @ + i l t y b 8 e @ e i l t y 9 + b @ + e l t y b 10 i @ + e i l y b 11 t @ + e i l t b

@ <+:0>

...then compose

slide-42
SLIDE 42

Composition

in+possible+ity im+possible+ity im+possibility impossibility

8 u i 5 a 2 s 1 d 10 + 9 e 3 e 4 m i 6 u 7 t
  • n
+ 29 u 23 s 14 p 11 l 12 i 13 k 30 e 18
  • 15
r 16 e 17 t 31 t 19 s 20 s 21 i 22 b 28 l 24 t 25 r 26 a 27 n g 32 e g l y 37 i 36 a 33 n 34 e 35 s 39 s l 38 t c y

@ + m p 1 n 2 <n:m> @ m p 4 + n <n:m> 3 + p @ + m n <n:m>

@ + e i l t y 1 b @ + e i t y b 7 l 2 <l:i> 3 <e:l> 4 <+:i> 5 <i:t> 6 <t:y> <y:0> @ + i l t y b 8 e @ e i l t y 9 + b @ + e l t y b 10 i @ + e i l y b 11 t @ + e i l t b

@ <+:0>

10 u i 5 a 2 s 1 d 9 <+:0> 8 e 3 e 4 m i 6 u 7 t
  • <+:0>
33 u 27 s 13 p 24 l 22 n 11 <n:m> 12 <+:0> p 17
  • 14
r 15 e 16 t 35 t 18 s 19 s 20 i 21 b 32 l 23 <+:0> u s l 25 i 26 k 34 e 28 t 29 r 30 a 31 n g 36 e g l y 41 i 40 a 37 n 38 e 39 s 43 s l 42 t c y

impossibility in+possible+ity

slide-43
SLIDE 43

Adding grammatical information

We’d like to be able to get parses with grammatical information: impossibilities => NEG+possible+ity+NOUN+PLURAL vs. in+possible+ity+s

slide-44
SLIDE 44

Adding grammatical information

8 u i 5 a 2 s 1 d 10 + 9 e 3 e 4 m i 6 u 7 t
  • n
+ 29 u 23 s 14 p 11 l 12 i 13 k 30 e 18
  • 15
r 16 e 17 t 31 t 19 s 20 s 21 i 22 b 28 l 24 t 25 r 26 a 27 n g 32 e g l y 37 i 36 a 33 n 34 e 35 s 39 s l 38 t c y

We’d like to be able to get parses with grammatical information: impossibilities => NEG+possible+ity+NOUN+PLURAL vs. in+possible+ity+s NEG+possible+ity+NOUN+PLURAL in+possible+ity+s

  • Lex. transducer

Solution: make lexicon a transduction: IN: OUT:

slide-45
SLIDE 45

Composition

in+possible+ity+s im+possible+ity+s im+possibility+s impossibilities

8 u i 5 a 2 s 1 d 10 + 9 e 3 e 4 m i 6 u 7 t
  • n
+ 29 u 23 s 14 p 11 l 12 i 13 k 30 e 18
  • 15
r 16 e 17 t 31 t 19 s 20 s 21 i 22 b 28 l 24 t 25 r 26 a 27 n g 32 e g l y 37 i 36 a 33 n 34 e 35 s 39 s l 38 t c y

@ + m p 1 n 2 <n:m> @ m p 4 + n <n:m> 3 + p @ + m n <n:m>

@ + e i l t y 1 b @ + e i t y b 7 l 2 <l:i> 3 <e:l> 4 <+:i> 5 <i:t> 6 <t:y> <y:0> @ + i l t y b 8 e @ e i l t y 9 + b @ + e l t y b 10 i @ + e i l y b 11 t @ + e i l t b

@ <+:0>

NEG+possible+ity+NOUN+PLURAL

slide-46
SLIDE 46

Composition

in+possible+ity+s im+possible+ity+s im+possibility+s impossibilities

8 u i 5 a 2 s 1 d 10 + 9 e 3 e 4 m i 6 u 7 t
  • n
+ 29 u 23 s 14 p 11 l 12 i 13 k 30 e 18
  • 15
r 16 e 17 t 31 t 19 s 20 s 21 i 22 b 28 l 24 t 25 r 26 a 27 n g 32 e g l y 37 i 36 a 33 n 34 e 35 s 39 s l 38 t c y

@ + m p 1 n 2 <n:m> @ m p 4 + n <n:m> 3 + p @ + m n <n:m>

@ + e i l t y 1 b @ + e i t y b 7 l 2 <l:i> 3 <e:l> 4 <+:i> 5 <i:t> 6 <t:y> <y:0> @ + i l t y b 8 e @ e i l t y 9 + b @ + e l t y b 10 i @ + e i l y b 11 t @ + e i l t b

@ <+:0>

10 u i 5 a 2 s 1 d 9 <+:0> 8 e 3 e 4 m i 6 u 7 t
  • <+:0>
33 u 27 s 13 p 24 l 22 n 11 <n:m> 12 <+:0> p 17
  • 14
r 15 e 16 t 35 t 18 s 19 s 20 i 21 b 32 l 23 <+:0> u s l 25 i 26 k 34 e 28 t 29 r 30 a 31 n g 36 e g l y 41 i 40 a 37 n 38 e 39 s 43 s l 42 t c y

impossibility

NEG+possible+ity+NOUN+PLURAL NEG+possible+ity+NOUN+PLURAL

slide-47
SLIDE 47

Compilers

Several finite-state compilers available to do the hard work

  • Xerox xfst (http://www.fsmbook.com)
  • SFST (https://code.google.com/p/cistern/wiki/

SFST)

  • HFST (http://hfst.sf.net)
  • OpenFST (http://www.openfst.org)
  • Foma (http://foma.googlecode.com)

Demo with foma