Maximum Likelihood Estimation of Factored Regular Deterministic - - PowerPoint PPT Presentation

maximum likelihood estimation of factored regular
SMART_READER_LITE
LIVE PREVIEW

Maximum Likelihood Estimation of Factored Regular Deterministic - - PowerPoint PPT Presentation

Maximum Likelihood Estimation of Factored Regular Deterministic Stochastic Languages Chihiro Shibata and Jeffrey Heinz University of Toronto July 19, 2019 We thank JSPS KAKENHI #JP18K11449 (CS) and NIH #R01HD87133-01 (JH) U. Toronto |


slide-1
SLIDE 1

Maximum Likelihood Estimation of Factored Regular Deterministic Stochastic Languages

Chihiro Shibata and Jeffrey Heinz University of Toronto July 19, 2019

We thank JSPS KAKENHI #JP18K11449 (CS) and NIH #R01HD87133-01 (JH)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 1

slide-2
SLIDE 2

The Problem in General

Stochastic languages are probability distributions over strings.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 2

slide-3
SLIDE 3

The Problem in General

Stochastic languages are probability distributions over strings. f : Σ∗ → [0, 1] and

  • w

f(w) = 1

  • U. Toronto | 2019/07/19

Shibata & Heinz | 2

slide-4
SLIDE 4

The Problem in General

Stochastic languages are probability distributions over strings. f : Σ∗ → [0, 1] and

  • w

f(w) = 1 A class C of stochastic languages is often defined parametrically: an assignment of values to parameters Θ uniquely determines some stochastic language fΘ ∈ C.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 2

slide-5
SLIDE 5

The Problem in General

Stochastic languages are probability distributions over strings. f : Σ∗ → [0, 1] and

  • w

f(w) = 1 A class C of stochastic languages is often defined parametrically: an assignment of values to parameters Θ uniquely determines some stochastic language fΘ ∈ C. An important learning criterion For any data sequence D drawn i.i.d. from any stochastic language, a Maximum-Likelihood Estimator finds parameter values ˆ Θ which maximize P(D) with respect to C.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 2

slide-6
SLIDE 6

The Problem in General

Stochastic languages are probability distributions over strings. f : Σ∗ → [0, 1] and

  • w

f(w) = 1 A class C of stochastic languages is often defined parametrically: an assignment of values to parameters Θ uniquely determines some stochastic language fΘ ∈ C. An important learning criterion For any data sequence D drawn i.i.d. from any stochastic language, a Maximum-Likelihood Estimator finds parameter values ˆ Θ which maximize P(D) with respect to C. For a class of stochastic languages C, is there an algorithm which reliably returns a Maximum-Likelihood Estimate (MLE)

  • f an observed data sample D?
  • U. Toronto | 2019/07/19

Shibata & Heinz | 2

slide-7
SLIDE 7

In Pictures

All stochastic languages

slide-8
SLIDE 8

In Pictures

All stochastic languages f1

slide-9
SLIDE 9

In Pictures

All stochastic languages f1 f2

slide-10
SLIDE 10

In Pictures

All stochastic languages f1 f2 C f2

slide-11
SLIDE 11

In Pictures

All stochastic languages f1 f2 C f2 D1 from f1 D2 from f2

slide-12
SLIDE 12

In Pictures

All stochastic languages f1 f2 C f2 D1 from f1 D2 from f2

fMLE(D1)

slide-13
SLIDE 13

In Pictures

All stochastic languages f1 f2 C f2 D1 from f1 D2 from f2

fMLE(D1) fMLE(D2)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 3

slide-14
SLIDE 14

Classes defined by Single DFAs

Example: Bigram model λ start a b a b b a a b Parameters θ⋊a θ⋊b θ⋊⋉ θaa θab θa⋉ θba θbb θb⋉

  • U. Toronto | 2019/07/19

Shibata & Heinz | 4

slide-15
SLIDE 15

Classes defined by Single DFAs

Example: Bigram model λ start a b a b b a a b Parameters θ⋊a θ⋊b θ⋊⋉ θaa θab θa⋉ θba θbb θb⋉ D = ab, aabb

  • U. Toronto | 2019/07/19

Shibata & Heinz | 4

slide-16
SLIDE 16

Classes defined by Single DFAs

Example: Bigram model λ start a b a b b a a b Parameters θ⋊a 1 θ⋊b θ⋊⋉ θaa θab 1 θa⋉ θba θbb θb⋉ 1 D = ab, aabb

MLE is obtained by passing D through DFA and normalizing. (Vidal et al. 2005)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 4

slide-17
SLIDE 17

Classes defined by Single DFAs

Example: Bigram model λ start a b a b b a a b Parameters θ⋊a 2 θ⋊b θ⋊⋉ θaa 1 θab 2 θa⋉ θba θbb 1 θb⋉ 2 D = ab, aabb

MLE is obtained by passing D through DFA and normalizing. (Vidal et al. 2005)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 4

slide-18
SLIDE 18

Classes defined by Single DFAs

Example: Bigram model λ start a b a b b a a b Parameters θ⋊a 1 θ⋊b θ⋊⋉ θaa 1/3 θab 2/3 θa⋉ θba θbb 1/3 θb⋉ 2/3 D = ab, aabb

MLE is obtained by passing D through DFA and normalizing. (Vidal et al. 2005)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 4

slide-19
SLIDE 19

Classes defined by Single DFAs

Example: Strictly 2-Local Languages λ start a b a b b a a b Parameters θ⋊a θ⋊b θ⋊⋉ θaa θab θa⋉ θba θbb θb⋉

  • U. Toronto | 2019/07/19

Shibata & Heinz | 5

slide-20
SLIDE 20

Classes defined by Single DFAs

Example: Strictly 2-Local Languages λ start a b a b b a a b Parameters θ⋊a θ⋊b θ⋊⋉ θaa θab θa⋉ θba θbb θb⋉ D = ab, aabb

  • U. Toronto | 2019/07/19

Shibata & Heinz | 5

slide-21
SLIDE 21

Classes defined by Single DFAs

Example: Strictly 2-Local Languages λ start a b a b b a a b Parameters θ⋊a 1 θ⋊b θ⋊⋉ θaa θab 1 θa⋉ θba θbb θb⋉ 1 D = ab, aabb

Smallest language consistent with D in C is obtained by passing D through DFA and ‘activating’ parsed transitions. (Heinz and Rogers 2013)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 5

slide-22
SLIDE 22

Classes defined by Single DFAs

Example: Strictly 2-Local Languages λ start a b a b b a a b Parameters θ⋊a 1 θ⋊b θ⋊⋉ θaa 1 θab 1 θa⋉ θba θbb 1 θb⋉ 1 D = ab, aabb

Smallest language consistent with D in C is obtained by passing D through DFA and ‘activating’ parsed transitions. (Heinz and Rogers 2013)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 5

slide-23
SLIDE 23

Classes defined by Single DFAs

Example: Strictly 2-Local Languages λ start a b a b b a a b Parameters θ⋊a 1 θ⋊b θ⋊⋉ θaa 1 θab 1 θa⋉ θba θbb 1 θb⋉ 1 D = ab, aabb

Smallest language consistent with D in C is obtained by passing D through DFA and ‘activating’ parsed transitions. (Heinz and Rogers 2013)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 5

slide-24
SLIDE 24

Overview of Related Results

Class C defined with single finitely DFA many DFA type of f : Σ∗ → {0, 1} language f : Σ∗ → [ 0, 1 ]

1 For Boolean languages, the learning algorithms return the

smallest language in C which includes D.

2 For Stochastic languages, the MLE returns the language in

C which maximizes likelihood of D.

(Vidal et al. 2005, Heinz and Rogers 2013)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 6

slide-25
SLIDE 25

Overview of Related Results

Class C defined with single finitely DFA many DFA type of f : Σ∗ → {0, 1}

  • language

f : Σ∗ → [ 0, 1 ]

1 For Boolean languages, the learning algorithms return the

smallest language in C which includes D.

2 For Stochastic languages, the MLE returns the language in

C which maximizes likelihood of D.

(Vidal et al. 2005, Heinz and Rogers 2013)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 6

slide-26
SLIDE 26

Overview of Related Results

Class C defined with single finitely DFA many DFA type of f : Σ∗ → {0, 1}

  • language

f : Σ∗ → [ 0, 1 ]

  • 1 For Boolean languages, the learning algorithms return the

smallest language in C which includes D.

2 For Stochastic languages, the MLE returns the language in

C which maximizes likelihood of D.

(Vidal et al. 2005, Heinz and Rogers 2013)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 6

slide-27
SLIDE 27

Overview of Related Results

Class C defined with single finitely DFA many DFA type of f : Σ∗ → {0, 1}

  • language

f : Σ∗ → [ 0, 1 ]

  • 1 For Boolean languages, the learning algorithms return the

smallest language in C which includes D.

2 For Stochastic languages, the MLE returns the language in

C which maximizes likelihood of D.

(Vidal et al. 2005, Heinz and Rogers 2013)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 6

slide-28
SLIDE 28

Overview of Related Results

Class C defined with single finitely DFA many DFA type of f : Σ∗ → {0, 1}

  • language

f : Σ∗ → [ 0, 1 ]

  • this paper

1 For Boolean languages, the learning algorithms return the

smallest language in C which includes D.

2 For Stochastic languages, the MLE returns the language in

C which maximizes likelihood of D.

(Vidal et al. 2005, Heinz and Rogers 2013)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 6

slide-29
SLIDE 29

Overview of Related Results (part 2)

1 The class of all DFAs is not identifiable in the limit from

positive data (Gold 1967).

  • U. Toronto | 2019/07/19

Shibata & Heinz | 7

slide-30
SLIDE 30

Overview of Related Results (part 2)

1 The class of all DFAs is not identifiable in the limit from

positive data (Gold 1967).

2 It is NP-hard to find the minimal DFA consistent with a

finite sample of positive and negative examples (Gold 1978).

  • U. Toronto | 2019/07/19

Shibata & Heinz | 7

slide-31
SLIDE 31

Overview of Related Results (part 2)

1 The class of all DFAs is not identifiable in the limit from

positive data (Gold 1967).

2 It is NP-hard to find the minimal DFA consistent with a

finite sample of positive and negative examples (Gold 1978).

3 Each DFA admits a characteristic sample D of positive and

negative examples such that RPNI identifies the DFA from any superset of D in cubic time (Oncina and Garcia 1992, DuPont 1996).

  • U. Toronto | 2019/07/19

Shibata & Heinz | 7

slide-32
SLIDE 32

Overview of Related Results (part 2)

1 The class of all DFAs is not identifiable in the limit from

positive data (Gold 1967).

2 It is NP-hard to find the minimal DFA consistent with a

finite sample of positive and negative examples (Gold 1978).

3 Each DFA admits a characteristic sample D of positive and

negative examples such that RPNI identifies the DFA from any superset of D in cubic time (Oncina and Garcia 1992, DuPont 1996).

4 ALEGRIA/RLIPS (based on RPNI) (Carrasco and Oncina

1994, 1999) learns the class of PDFAs in polynomial time with probability one (de la Higuera and Thollard 2001).

  • U. Toronto | 2019/07/19

Shibata & Heinz | 7

slide-33
SLIDE 33

Overview of Related Results (part 2)

1 The class of all DFAs is not identifiable in the limit from

positive data (Gold 1967).

2 It is NP-hard to find the minimal DFA consistent with a

finite sample of positive and negative examples (Gold 1978).

3 Each DFA admits a characteristic sample D of positive and

negative examples such that RPNI identifies the DFA from any superset of D in cubic time (Oncina and Garcia 1992, DuPont 1996).

4 ALEGRIA/RLIPS (based on RPNI) (Carrasco and Oncina

1994, 1999) learns the class of PDFAs in polynomial time with probability one (de la Higuera and Thollard 2001).

5 Clark and Thollard (2004) present an algorithm which

learns the class of PDFAs in a modified PAC setting. (See also Parekh and Hanover 2001.)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 7

slide-34
SLIDE 34

Overview of Related Results (part 2)

1 The class of all DFAs is not identifiable in the limit from

positive data (Gold 1967).

2 It is NP-hard to find the minimal DFA consistent with a

finite sample of positive and negative examples (Gold 1978).

3 Each DFA admits a characteristic sample D of positive and

negative examples such that RPNI identifies the DFA from any superset of D in cubic time (Oncina and Garcia 1992, DuPont 1996).

4 ALEGRIA/RLIPS (based on RPNI) (Carrasco and Oncina

1994, 1999) learns the class of PDFAs in polynomial time with probability one (de la Higuera and Thollard 2001).

5 Clark and Thollard (2004) present an algorithm which

learns the class of PDFAs in a modified PAC setting. (See also Parekh and Hanover 2001.)

6 Maximization-Expectation techniques are used to learn the

class of PNFAs, but there is no guarantee to find a global

  • ptimum (Rabiner 1989).
  • U. Toronto | 2019/07/19

Shibata & Heinz | 7

slide-35
SLIDE 35

Defining C with finitely many DFA

  • U. Toronto | 2019/07/19

Shibata & Heinz | 8

slide-36
SLIDE 36

Defining C with finitely many DFA

How do you define a class C with finitely many DFA?

  • U. Toronto | 2019/07/19

Shibata & Heinz | 8

slide-37
SLIDE 37

Defining C with finitely many DFA

How do you define a class C with finitely many DFA? λ start a a b, c a, b, c λ start b b a, c a, b, c λ start c c a, b a, b, c

  • U. Toronto | 2019/07/19

Shibata & Heinz | 8

slide-38
SLIDE 38

Defining C with finitely many DFA

How do you define a class C with finitely many DFA? λ start a a b, c a, b, c λ start b b a, c a, b, c λ start c c a, b a, b, c Product Operations

1 For Boolean languages, use acceptor product (yields

intersection)

2 For Stochastic languages, use co-emission product

(yields joint distribution)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 8

slide-39
SLIDE 39

The product of those three acceptors

λ start a b c ab bc ac abc a b c a b c b c a c b a a, b c b, c a a, c b a, b, c (exit/accepting arrow at each state is not shown)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 9

slide-40
SLIDE 40

The product of those three acceptors

λ start a b c ab bc ac abc a b c a b c b c a c b a a, b c b, c a a, c b a, b, c (exit/accepting arrow at each state is not shown)

  • If C is defined by this DFA, then C = Piecewise 2-Testable.
  • U. Toronto | 2019/07/19

Shibata & Heinz | 9

slide-41
SLIDE 41

The product of those three acceptors

λ start a b c ab bc ac abc a b c a b c b c a c b a a, b c b, c a a, c b a, b, c (exit/accepting arrow at each state is not shown)

  • If C is defined by this DFA, then C = Piecewise 2-Testable.
  • If C is defined by the 3 atomic DFAs, then C = Strictly

2-Piecewise.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 9

slide-42
SLIDE 42

Cause . . .

λ start a a b, c a, b, c λ start b b a, c a, b, c λ start c c a, b a, b, c

  • U. Toronto | 2019/07/19

Shibata & Heinz | 10

slide-43
SLIDE 43

Cause . . .

λ start a a b, c a, b, c λ start b b a, c a, b, c λ start c c a, b a, b, c The parameters of the model are set at the level of the individual DFA.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 10

slide-44
SLIDE 44

Cause . . .

λ start a a b, c a, b, c λ start b b a, c a, b, c λ start c c a, b a, b, c The parameters of the model are set at the level of the individual DFA. ✗

  • U. Toronto | 2019/07/19

Shibata & Heinz | 10

slide-45
SLIDE 45

. . . and Effect

λ start a b c ab bc ac abc a b c a b c b c a c b a a, b c b, c a a, c b a, b, c (exit/accepting arrow at each state is not shown)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 11

slide-46
SLIDE 46

. . . and Effect

λ start a b c ab bc ac abc a b c a b c b c a c b a a, b c b, c a a, c b a, b, c (exit/accepting arrow at each state is not shown) ✗

  • U. Toronto | 2019/07/19

Shibata & Heinz | 11

slide-47
SLIDE 47

. . . and Effect

λ start a b c ab bc ac abc a b c a b c b c a c b a a, b c b, c a a, c b a, b, c (exit/accepting arrow at each state is not shown) ✗ ✗

  • U. Toronto | 2019/07/19

Shibata & Heinz | 11

slide-48
SLIDE 48

. . . and Effect

λ start a b c ab bc ac abc a b c a b c b c a c b a a, b c b, c a a, c b a, b, c (exit/accepting arrow at each state is not shown) ✗ ✗ ✗

  • U. Toronto | 2019/07/19

Shibata & Heinz | 11

slide-49
SLIDE 49

. . . and Effect

λ start a b c ab bc ac abc a b c a b c b c a c b a a, b c b, c a a, c b a, b, c (exit/accepting arrow at each state is not shown) ✗ ✗ ✗ ✗

  • U. Toronto | 2019/07/19

Shibata & Heinz | 11

slide-50
SLIDE 50

Comparing the Representations

The Product DFA The Atomic DFAs

  • U. Toronto | 2019/07/19

Shibata & Heinz | 12

slide-51
SLIDE 51

Comparing the Representations

The Product DFA

1 In the worst case, it has i |Qi| states and (|Σ| + 1) i |Qi|

parameters. The Atomic DFAs

  • U. Toronto | 2019/07/19

Shibata & Heinz | 12

slide-52
SLIDE 52

Comparing the Representations

The Product DFA

1 In the worst case, it has i |Qi| states and (|Σ| + 1) i |Qi|

parameters. The Atomic DFAs

1 The atomic DFAs have a total of i |Qi| states and

(|Σ| + 1)

i |Qi| parameters.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 12

slide-53
SLIDE 53

Comparing the Representations

The Product DFA

1 In the worst case, it has i |Qi| states and (|Σ| + 1) i |Qi|

parameters.

2 Transitions/parameters are independent of others.

The Atomic DFAs

1 The atomic DFAs have a total of i |Qi| states and

(|Σ| + 1)

i |Qi| parameters.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 12

slide-54
SLIDE 54

Comparing the Representations

The Product DFA

1 In the worst case, it has i |Qi| states and (|Σ| + 1) i |Qi|

parameters.

2 Transitions/parameters are independent of others.

The Atomic DFAs

1 The atomic DFAs have a total of i |Qi| states and

(|Σ| + 1)

i |Qi| parameters. 2 The transitions in the product are NOT independent.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 12

slide-55
SLIDE 55

Pluses and Minuses

  • U. Toronto | 2019/07/19

Shibata & Heinz | 13

slide-56
SLIDE 56

Pluses and Minuses

+ Fewer parameters means more accurate estimation of model parameters with less data.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 13

slide-57
SLIDE 57

Pluses and Minuses

+ Fewer parameters means more accurate estimation of model parameters with less data. − Fewer parameters means the model is less expressive.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 13

slide-58
SLIDE 58

Pluses and Minuses

+ Fewer parameters means more accurate estimation of model parameters with less data. − Fewer parameters means the model is less expressive.

  • Heinz and Rogers (2013, MoL) extend the method of

‘activating’ data-parsed transitions to learn classes of Boolean languages defined with single DFA to classes of Boolean languages defined with finitely many DFA.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 13

slide-59
SLIDE 59

Pluses and Minuses

+ Fewer parameters means more accurate estimation of model parameters with less data. − Fewer parameters means the model is less expressive.

  • Heinz and Rogers (2013, MoL) extend the method of

‘activating’ data-parsed transitions to learn classes of Boolean languages defined with single DFA to classes of Boolean languages defined with finitely many DFA.

  • They show it always returns the smallest Boolean language

in the class consistent with the data, and thus identifies the class in the limit from positive data.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 13

slide-60
SLIDE 60

The Co-emission Product

  • U. Toronto | 2019/07/19

Shibata & Heinz | 14

slide-61
SLIDE 61

The Co-emission Product

  • The co-emission product defines how PDFA-definable

stochastic languages can be multiplied together to yield a well-defined stochastic language.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 14

slide-62
SLIDE 62

The Co-emission Product

  • The co-emission product defines how PDFA-definable

stochastic languages can be multiplied together to yield a well-defined stochastic language.

  • Heinz and Rogers 2010 defined stochastic Strictly

k-Piecewise languages using a variant of the co-emission product.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 14

slide-63
SLIDE 63

The Co-emission Product

  • The co-emission product defines how PDFA-definable

stochastic languages can be multiplied together to yield a well-defined stochastic language.

  • Heinz and Rogers 2010 defined stochastic Strictly

k-Piecewise languages using a variant of the co-emission product.

  • They claimed they could find the MLE, but nobody

seemed convinced.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 14

slide-64
SLIDE 64

The Co-emission Product

  • The co-emission product defines how PDFA-definable

stochastic languages can be multiplied together to yield a well-defined stochastic language.

  • Heinz and Rogers 2010 defined stochastic Strictly

k-Piecewise languages using a variant of the co-emission product.

  • They claimed they could find the MLE, but nobody

seemed convinced.

x Pr(x | P≤1(y)) s > ts S > tS s 0.0335 0.0051 0.0011 0.0002 ⁀ ts 0.0218 0.0113 0.0009 0. y S 0.0009 0. 0.0671 0.0353 > tS 0.0006 0. 0.0455 0.0313 Table: Results of SP2 estimation on the Samala corpus. Only sibilants are shown. (Heinz and Rogers 2010, p. 894)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 14

slide-65
SLIDE 65

The Co-Emission Product (definition)

q1 r1 a : q1a b : q1b c : q1c ⊗ q2 r2 a : q2a b : q2b c : q2c

  • U. Toronto | 2019/07/19

Shibata & Heinz | 15

slide-66
SLIDE 66

The Co-Emission Product (definition)

q1 r1 a : q1a b : q1b c : q1c ⊗ q2 r2 a : q2a b : q2b c : q2c = q1, q2 r1, r2 a where P(a | q1, q2)

def

=

  • i qia
  • σ
  • i qiσ
  • U. Toronto | 2019/07/19

Shibata & Heinz | 15

slide-67
SLIDE 67

The Co-Emission Product (definition)

q1 r1 a : q1a b : q1b c : q1c ⊗ q2 r2 a : q2a b : q2b c : q2c = q1, q2 r1, r2 a where P(a | q1, q2)

def

=

  • i qia
  • σ
  • i qiσ

For fixed σ, the co-emission product treats the parameters qiσ as independent.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 15

slide-68
SLIDE 68

Contributions

1 We extend Heinz and Rogers 2010 analysis to classes

defined with

  • U. Toronto | 2019/07/19

Shibata & Heinz | 16

slide-69
SLIDE 69

Contributions

1 We extend Heinz and Rogers 2010 analysis to classes

defined with

1 the standard co-emission product (not the variant

introduced by Heinz and Rogers)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 16

slide-70
SLIDE 70

Contributions

1 We extend Heinz and Rogers 2010 analysis to classes

defined with

1 the standard co-emission product (not the variant

introduced by Heinz and Rogers)

2 of arbitrary sets of finitely many PDFAs (not just the ones

which define stochastic SPk languages)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 16

slide-71
SLIDE 71

Contributions

1 We extend Heinz and Rogers 2010 analysis to classes

defined with

1 the standard co-emission product (not the variant

introduced by Heinz and Rogers)

2 of arbitrary sets of finitely many PDFAs (not just the ones

which define stochastic SPk languages)

2 Essentially, we prove that parameters which maximize the

probability of the data with respect to such models are found by running the corpus through each of the individual factor PDFAs and calculating the relative frequencies.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 16

slide-72
SLIDE 72

Some details of the analysis

1 Probability of Words 2 Relative Frequency of Emissions 3 Empirical Mean of co-emission probabilities 4 Main Theorems

  • U. Toronto | 2019/07/19

Shibata & Heinz | 17

slide-73
SLIDE 73

Probability of words

  • Consider a class C defined with the co-emission product of

K machines M1 . . . MK.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 18

slide-74
SLIDE 74

Probability of words

  • Consider a class C defined with the co-emission product of

K machines M1 . . . MK.

  • Suppose that w = σ1 · · · σN
  • U. Toronto | 2019/07/19

Shibata & Heinz | 18

slide-75
SLIDE 75

Probability of words

  • Consider a class C defined with the co-emission product of

K machines M1 . . . MK.

  • Suppose that w = σ1 · · · σN
  • Let q(j, i) denote a state in Qj that is reached after Mj

reads the prefix σ1 · · · σi−1.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 18

slide-76
SLIDE 76

Probability of words

  • Consider a class C defined with the co-emission product of

K machines M1 . . . MK.

  • Suppose that w = σ1 · · · σN
  • Let q(j, i) denote a state in Qj that is reached after Mj

reads the prefix σ1 · · · σi−1.

  • If i = 1 then q(j, i) represents the initial state of Mj.
  • U. Toronto | 2019/07/19

Shibata & Heinz | 18

slide-77
SLIDE 77

Probability of words

  • Consider a class C defined with the co-emission product of

K machines M1 . . . MK.

  • Suppose that w = σ1 · · · σN
  • Let q(j, i) denote a state in Qj that is reached after Mj

reads the prefix σ1 · · · σi−1.

  • If i = 1 then q(j, i) represents the initial state of Mj.
  • Let Tj(q, σ) denote a parameter (transitional

probabability) in PDFA Mj.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 18

slide-78
SLIDE 78

Probability of words

  • Consider a class C defined with the co-emission product of

K machines M1 . . . MK.

  • Suppose that w = σ1 · · · σN
  • Let q(j, i) denote a state in Qj that is reached after Mj

reads the prefix σ1 · · · σi−1.

  • If i = 1 then q(j, i) represents the initial state of Mj.
  • Let Tj(q, σ) denote a parameter (transitional

probabability) in PDFA Mj.

  • Then the probability that σ is emitted after the product

machine

1≤j≤K Mj reads the prefix σ1 · · · σi−1 is the

following: Coemit(σ, i) = K

j=1 Tj(q(j, i), σ)

  • σ′∈Σ

K

j=1 Tj(q(j, i), σ′).

(1)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 18

slide-79
SLIDE 79

Probability of words

  • Consider a class C defined with the co-emission product of

K machines M1 . . . MK.

  • Suppose that w = σ1 · · · σN
  • Let q(j, i) denote a state in Qj that is reached after Mj

reads the prefix σ1 · · · σi−1.

  • If i = 1 then q(j, i) represents the initial state of Mj.
  • Let Tj(q, σ) denote a parameter (transitional

probabability) in PDFA Mj.

  • Then the probability that σ is emitted after the product

machine

1≤j≤K Mj reads the prefix σ1 · · · σi−1 is the

following: Coemit(σ, i) = K

j=1 Tj(q(j, i), σ)

  • σ′∈Σ

K

j=1 Tj(q(j, i), σ′).

(1)

  • We assume that there is a end marker ⋉ ∈ Σ which

uniquely occurs at the end of words.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 18

slide-80
SLIDE 80

Probability of words

  • Consider a class C defined with the co-emission product of

K machines M1 . . . MK.

  • Suppose that w = σ1 · · · σN
  • Let q(j, i) denote a state in Qj that is reached after Mj

reads the prefix σ1 · · · σi−1.

  • If i = 1 then q(j, i) represents the initial state of Mj.
  • Let Tj(q, σ) denote a parameter (transitional

probabability) in PDFA Mj.

  • Then the probability that σ is emitted after the product

machine

1≤j≤K Mj reads the prefix σ1 · · · σi−1 is the

following: Coemit(σ, i) = K

j=1 Tj(q(j, i), σ)

  • σ′∈Σ

K

j=1 Tj(q(j, i), σ′).

(1)

  • We assume that there is a end marker ⋉ ∈ Σ which

uniquely occurs at the end of words. P(w⋉) =

N+1

  • i=1

Coemit(σi, i) (2)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 18

slide-81
SLIDE 81

Relative Frequency of Emission

  • U. Toronto | 2019/07/19

Shibata & Heinz | 19

slide-82
SLIDE 82

Relative Frequency of Emission

  • Let mw(Mj, q, σ) ∈ Z+ denote how many times σ is

emitted at the state q while the machine Mj emits w.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 19

slide-83
SLIDE 83

Relative Frequency of Emission

  • Let mw(Mj, q, σ) ∈ Z+ denote how many times σ is

emitted at the state q while the machine Mj emits w.

  • Let nw(Mj, q) ∈ Z+ denote how many times the state q is

visited while the machine Mj emits w.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 19

slide-84
SLIDE 84

Relative Frequency of Emission

  • Let mw(Mj, q, σ) ∈ Z+ denote how many times σ is

emitted at the state q while the machine Mj emits w.

  • Let nw(Mj, q) ∈ Z+ denote how many times the state q is

visited while the machine Mj emits w. Then freqw(σ|Mj, q) = mw(Mj, q, σ) nw(Mj, q) , (3) represents the relative frequency that Mj emits σ at q during emission of w.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 19

slide-85
SLIDE 85

Relative Frequency of Emission

  • Let mw(Mj, q, σ) ∈ Z+ denote how many times σ is

emitted at the state q while the machine Mj emits w.

  • Let nw(Mj, q) ∈ Z+ denote how many times the state q is

visited while the machine Mj emits w. Then freqw(σ|Mj, q) = mw(Mj, q, σ) nw(Mj, q) , (3) represents the relative frequency that Mj emits σ at q during emission of w. It is straightforward to lift this definition to data sequences D = w1⋉, w2⋉, . . . w|D|⋉ by letting w = w1 ⋉ w2 ⋉ . . . w|D|⋉.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 19

slide-86
SLIDE 86

Empirical Mean of co-emission probabilities

  • U. Toronto | 2019/07/19

Shibata & Heinz | 20

slide-87
SLIDE 87

Empirical Mean of co-emission probabilities

sumCoemitw(σ, Mj, q) =

  • i s.t. q(j,i)=q

Coemit(σ, i).

  • U. Toronto | 2019/07/19

Shibata & Heinz | 20

slide-88
SLIDE 88

Empirical Mean of co-emission probabilities

sumCoemitw(σ, Mj, q) =

  • i s.t. q(j,i)=q

Coemit(σ, i). The empirical mean of a co-emission probability is defined as follows: Coemitw(σ|Mj, q) = sumCoemitw(σ, Mj, q) nw(Mj, q) , (4)

  • U. Toronto | 2019/07/19

Shibata & Heinz | 20

slide-89
SLIDE 89

Empirical Mean of co-emission probabilities

sumCoemitw(σ, Mj, q) =

  • i s.t. q(j,i)=q

Coemit(σ, i). The empirical mean of a co-emission probability is defined as follows: Coemitw(σ|Mj, q) = sumCoemitw(σ, Mj, q) nw(Mj, q) , (4) This is the sample average of the co-emission probability when q ∈ Qj is visited.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 20

slide-90
SLIDE 90

Main Theorem

  • U. Toronto | 2019/07/19

Shibata & Heinz | 21

slide-91
SLIDE 91

Main Theorem

Consider any parameter Tj(q, σ) in PDFA Mj. Theorem ∂P(D)/∂Tj(q, σ) = 0 holds for all j if and only if the following equation is satisfied for all 1 ≤ j ≤ K: freqw(σ|Mj, q) = Coemitw(σ|Mj, q) .

  • U. Toronto | 2019/07/19

Shibata & Heinz | 21

slide-92
SLIDE 92

Example

λ start λ start a λ start b a,b a b a,b b a a,b

Figure: The 2-set of of SD-PDFAs with Σ = {a, b}. There are 15

  • parameters. Suppose D = abb ⋉ bbb⋉.
  • U. Toronto | 2019/07/19

Shibata & Heinz | 22

slide-93
SLIDE 93

Example

freqD(a|Mλ, λ) = 1/8 freqD(a|Ma, λ) = 1/5 freqD(a|Ma, a) = 0/3, freqD(b|Mλ, λ) = 5/8 freqD(b|Ma, λ) = 3/5 freqD(b|Ma, a) = 2/3, freqD(⋉|Mλ, λ) = 2/8 freqD(⋉|Ma, λ) = 1/5 freqD(⋉|Ma, a) = 1/3, freqD(a|Mb, λ) = 1/3 freqD(a|Mb, b) = 3/5, freqD(b|Mb, λ) = 2/3 freqD(b|Mb, b) = 0/5, freqD(⋉|Mb, λ) = 0/3 freqD(⋉|Mb, b) = 2/5, Figure: Frequency computations with D = abb ⋉ bbb⋉ and the 2-set

  • f of SD-PDFAs on previous slide.
  • U. Toronto | 2019/07/19

Shibata & Heinz | 22

slide-94
SLIDE 94

Convexity of the Negative Log Likelihood

  • U. Toronto | 2019/07/19

Shibata & Heinz | 23

slide-95
SLIDE 95

Convexity of the Negative Log Likelihood

Let τj,q,σ denote log Tj(q, σ); i.e. the log of a parameter of C defined with

j Mj.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 23

slide-96
SLIDE 96

Convexity of the Negative Log Likelihood

Let τj,q,σ denote log Tj(q, σ); i.e. the log of a parameter of C defined with

j Mj.

Then τ can be thought of as a vector in Rn where n is the number of parameters.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 23

slide-97
SLIDE 97

Convexity of the Negative Log Likelihood

Let τj,q,σ denote log Tj(q, σ); i.e. the log of a parameter of C defined with

j Mj.

Then τ can be thought of as a vector in Rn where n is the number of parameters. Theorem − log P(w⋉) is convex with respect to τ ∈ Rn.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 23

slide-98
SLIDE 98

Convexity of the Negative Log Likelihood

Let τj,q,σ denote log Tj(q, σ); i.e. the log of a parameter of C defined with

j Mj.

Then τ can be thought of as a vector in Rn where n is the number of parameters. Theorem − log P(w⋉) is convex with respect to τ ∈ Rn. Thus the solution obtained by the previous theorem is a MLE.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 23

slide-99
SLIDE 99

Discussion

  • U. Toronto | 2019/07/19

Shibata & Heinz | 24

slide-100
SLIDE 100

Discussion

At a high level, the problem we considered is a decomposition

  • f complex probability distributions into simpler factors.
  • U. Toronto | 2019/07/19

Shibata & Heinz | 24

slide-101
SLIDE 101

Discussion

At a high level, the problem we considered is a decomposition

  • f complex probability distributions into simpler factors.

This has also been studied in the context of Bayesian networks, Markov random fields, and probabilistic graphical models more generally (Bishop, 2006; Koller and Friedman, 2009).

  • U. Toronto | 2019/07/19

Shibata & Heinz | 24

slide-102
SLIDE 102

Discussion

At a high level, the problem we considered is a decomposition

  • f complex probability distributions into simpler factors.

This has also been studied in the context of Bayesian networks, Markov random fields, and probabilistic graphical models more generally (Bishop, 2006; Koller and Friedman, 2009). A reviewer points out that this literature may simplify our proofs.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 24

slide-103
SLIDE 103

Future Work

1 Language Modeling with various sets of specific factors and

various corpora such as . . .

1 SLk + SPk 2 SLPk,ℓ (Rogers and Lambert 2019, MoL) 3 Atomic PDFA based on phonological features (Chandlee et

  • al. 2019, MoL)

2 . . . and compare to NNs, ALERGIA, and other algorithms

  • n various benchmarks.

3 Connections to probababilistic graphical models 4 Extend results to weighted deterministic automata.

  • U. Toronto | 2019/07/19

Shibata & Heinz | 25

slide-104
SLIDE 104

Thanks

We acknowledge Canaan Breiss, Morgan Cassels, Huteng Dai, Danny DeSantiago, Anton Kukhto, Jon Rawski, Yang Wang, and Yuhong Zhu for valuable feedback on a draft presentation.

Questions?

  • U. Toronto | 2019/07/19

Shibata & Heinz | 26