maximum likelihood estimation of factored regular
play

Maximum Likelihood Estimation of Factored Regular Deterministic - PowerPoint PPT Presentation

Maximum Likelihood Estimation of Factored Regular Deterministic Stochastic Languages Chihiro Shibata and Jeffrey Heinz University of Toronto July 19, 2019 We thank JSPS KAKENHI #JP18K11449 (CS) and NIH #R01HD87133-01 (JH) U. Toronto |


  1. Overview of Related Results (part 2) 1 The class of all DFAs is not identifiable in the limit from positive data (Gold 1967). 2 It is NP-hard to find the minimal DFA consistent with a finite sample of positive and negative examples (Gold 1978). 3 Each DFA admits a characteristic sample D of positive and negative examples such that RPNI identifies the DFA from any superset of D in cubic time (Oncina and Garcia 1992, DuPont 1996). 4 ALEGRIA/RLIPS (based on RPNI) (Carrasco and Oncina 1994, 1999) learns the class of PDFAs in polynomial time with probability one (de la Higuera and Thollard 2001). U. Toronto | 2019/07/19 Shibata & Heinz | 7

  2. Overview of Related Results (part 2) 1 The class of all DFAs is not identifiable in the limit from positive data (Gold 1967). 2 It is NP-hard to find the minimal DFA consistent with a finite sample of positive and negative examples (Gold 1978). 3 Each DFA admits a characteristic sample D of positive and negative examples such that RPNI identifies the DFA from any superset of D in cubic time (Oncina and Garcia 1992, DuPont 1996). 4 ALEGRIA/RLIPS (based on RPNI) (Carrasco and Oncina 1994, 1999) learns the class of PDFAs in polynomial time with probability one (de la Higuera and Thollard 2001). 5 Clark and Thollard (2004) present an algorithm which learns the class of PDFAs in a modified PAC setting. (See also Parekh and Hanover 2001.) U. Toronto | 2019/07/19 Shibata & Heinz | 7

  3. Overview of Related Results (part 2) 1 The class of all DFAs is not identifiable in the limit from positive data (Gold 1967). 2 It is NP-hard to find the minimal DFA consistent with a finite sample of positive and negative examples (Gold 1978). 3 Each DFA admits a characteristic sample D of positive and negative examples such that RPNI identifies the DFA from any superset of D in cubic time (Oncina and Garcia 1992, DuPont 1996). 4 ALEGRIA/RLIPS (based on RPNI) (Carrasco and Oncina 1994, 1999) learns the class of PDFAs in polynomial time with probability one (de la Higuera and Thollard 2001). 5 Clark and Thollard (2004) present an algorithm which learns the class of PDFAs in a modified PAC setting. (See also Parekh and Hanover 2001.) 6 Maximization-Expectation techniques are used to learn the class of PNFAs, but there is no guarantee to find a global optimum (Rabiner 1989). U. Toronto | 2019/07/19 Shibata & Heinz | 7

  4. Defining C with finitely many DFA U. Toronto | 2019/07/19 Shibata & Heinz | 8

  5. Defining C with finitely many DFA How do you define a class C with finitely many DFA? U. Toronto | 2019/07/19 Shibata & Heinz | 8

  6. Defining C with finitely many DFA How do you define a class C with finitely many DFA? a, b, c a, b, c a, b, c a c b a b c a, c b, c a, b start λ start λ start λ U. Toronto | 2019/07/19 Shibata & Heinz | 8

  7. Defining C with finitely many DFA How do you define a class C with finitely many DFA? a, b, c a, b, c a, b, c a c b a b c a, c b, c a, b start λ start λ start λ Product Operations 1 For Boolean languages, use acceptor product (yields intersection) 2 For Stochastic languages, use co-emission product (yields joint distribution) U. Toronto | 2019/07/19 Shibata & Heinz | 8

  8. The product of those three acceptors a, b a b a ab c a c a, b, c a, c b a b ac start λ b abc c b c b, c a c a c bc b (exit/accepting arrow at each state is not shown) U. Toronto | 2019/07/19 Shibata & Heinz | 9

  9. The product of those three acceptors a, b a b a ab c a c a, b, c a, c b a b ac start λ b abc c b c b, c a c a c bc b (exit/accepting arrow at each state is not shown) • If C is defined by this DFA, then C = Piecewise 2-Testable. U. Toronto | 2019/07/19 Shibata & Heinz | 9

  10. The product of those three acceptors a, b a b a ab c a c a, b, c a, c b a b ac start λ b abc c b c b, c a c a c bc b (exit/accepting arrow at each state is not shown) • If C is defined by this DFA, then C = Piecewise 2-Testable. • If C is defined by the 3 atomic DFAs, then C = Strictly 2-Piecewise. U. Toronto | 2019/07/19 Shibata & Heinz | 9

  11. Cause . . . a, b, c a, b, c a, b, c a c b a b c a, c b, c a, b start λ start λ start λ U. Toronto | 2019/07/19 Shibata & Heinz | 10

  12. Cause . . . a, b, c a, b, c a, b, c a c b a b c a, c b, c a, b start λ start λ start λ The parameters of the model are set at the level of the individual DFA. U. Toronto | 2019/07/19 Shibata & Heinz | 10

  13. Cause . . . a, b, c a, b, c a, b, c ✗ a c b a b c a, c b, c a, b start λ start λ start λ The parameters of the model are set at the level of the individual DFA. U. Toronto | 2019/07/19 Shibata & Heinz | 10

  14. . . . and Effect a, b a b a ab c a c a, b, c a, c b a b ac start λ b abc c b c b, c a c a c bc b (exit/accepting arrow at each state is not shown) U. Toronto | 2019/07/19 Shibata & Heinz | 11

  15. . . . and Effect a, b a b ✗ a ab c a c a, b, c a, c b a b ac start λ b abc c b c b, c a c a c bc b (exit/accepting arrow at each state is not shown) U. Toronto | 2019/07/19 Shibata & Heinz | 11

  16. . . . and Effect ✗ a, b a b ✗ a ab c a c a, b, c a, c b a b ac start λ b abc c b c b, c a c a c bc b (exit/accepting arrow at each state is not shown) U. Toronto | 2019/07/19 Shibata & Heinz | 11

  17. . . . and Effect ✗ a, b a b ✗ a ab c a c ✗ a, b, c a, c b a b ac start λ b abc c b c b, c a c a c bc b (exit/accepting arrow at each state is not shown) U. Toronto | 2019/07/19 Shibata & Heinz | 11

  18. . . . and Effect ✗ a, b a b ✗ a ab c a c ✗ a, b, c a, c b a b ac start λ b abc ✗ c b c b, c a c a c bc b (exit/accepting arrow at each state is not shown) U. Toronto | 2019/07/19 Shibata & Heinz | 11

  19. Comparing the Representations The Product DFA The Atomic DFAs U. Toronto | 2019/07/19 Shibata & Heinz | 12

  20. Comparing the Representations The Product DFA 1 In the worst case, it has � i | Q i | states and ( | Σ | + 1) � i | Q i | parameters. The Atomic DFAs U. Toronto | 2019/07/19 Shibata & Heinz | 12

  21. Comparing the Representations The Product DFA 1 In the worst case, it has � i | Q i | states and ( | Σ | + 1) � i | Q i | parameters. The Atomic DFAs 1 The atomic DFAs have a total of � i | Q i | states and ( | Σ | + 1) � i | Q i | parameters. U. Toronto | 2019/07/19 Shibata & Heinz | 12

  22. Comparing the Representations The Product DFA 1 In the worst case, it has � i | Q i | states and ( | Σ | + 1) � i | Q i | parameters. 2 Transitions/parameters are independent of others. The Atomic DFAs 1 The atomic DFAs have a total of � i | Q i | states and ( | Σ | + 1) � i | Q i | parameters. U. Toronto | 2019/07/19 Shibata & Heinz | 12

  23. Comparing the Representations The Product DFA 1 In the worst case, it has � i | Q i | states and ( | Σ | + 1) � i | Q i | parameters. 2 Transitions/parameters are independent of others. The Atomic DFAs 1 The atomic DFAs have a total of � i | Q i | states and ( | Σ | + 1) � i | Q i | parameters. 2 The transitions in the product are NOT independent. U. Toronto | 2019/07/19 Shibata & Heinz | 12

  24. Pluses and Minuses U. Toronto | 2019/07/19 Shibata & Heinz | 13

  25. Pluses and Minuses + Fewer parameters means more accurate estimation of model parameters with less data. U. Toronto | 2019/07/19 Shibata & Heinz | 13

  26. Pluses and Minuses + Fewer parameters means more accurate estimation of model parameters with less data. − Fewer parameters means the model is less expressive. U. Toronto | 2019/07/19 Shibata & Heinz | 13

  27. Pluses and Minuses + Fewer parameters means more accurate estimation of model parameters with less data. − Fewer parameters means the model is less expressive. • Heinz and Rogers (2013, MoL) extend the method of ‘activating’ data-parsed transitions to learn classes of Boolean languages defined with single DFA to classes of Boolean languages defined with finitely many DFA. U. Toronto | 2019/07/19 Shibata & Heinz | 13

  28. Pluses and Minuses + Fewer parameters means more accurate estimation of model parameters with less data. − Fewer parameters means the model is less expressive. • Heinz and Rogers (2013, MoL) extend the method of ‘activating’ data-parsed transitions to learn classes of Boolean languages defined with single DFA to classes of Boolean languages defined with finitely many DFA. • They show it always returns the smallest Boolean language in the class consistent with the data, and thus identifies the class in the limit from positive data. U. Toronto | 2019/07/19 Shibata & Heinz | 13

  29. The Co-emission Product U. Toronto | 2019/07/19 Shibata & Heinz | 14

  30. The Co-emission Product • The co-emission product defines how PDFA-definable stochastic languages can be multiplied together to yield a well-defined stochastic language. U. Toronto | 2019/07/19 Shibata & Heinz | 14

  31. The Co-emission Product • The co-emission product defines how PDFA-definable stochastic languages can be multiplied together to yield a well-defined stochastic language. • Heinz and Rogers 2010 defined stochastic Strictly k -Piecewise languages using a variant of the co-emission product. U. Toronto | 2019/07/19 Shibata & Heinz | 14

  32. The Co-emission Product • The co-emission product defines how PDFA-definable stochastic languages can be multiplied together to yield a well-defined stochastic language. • Heinz and Rogers 2010 defined stochastic Strictly k -Piecewise languages using a variant of the co-emission product. • They claimed they could find the MLE, but nobody seemed convinced. U. Toronto | 2019/07/19 Shibata & Heinz | 14

  33. The Co-emission Product • The co-emission product defines how PDFA-definable stochastic languages can be multiplied together to yield a well-defined stochastic language. • Heinz and Rogers 2010 defined stochastic Strictly k -Piecewise languages using a variant of the co-emission product. • They claimed they could find the MLE, but nobody seemed convinced. x Pr ( x | P ≤ 1 ( y )) > > s ts S tS s 0.0335 0.0051 0.0011 0.0002 ⁀ ts 0.0218 0.0113 0.0009 0. y S 0.0009 0. 0.0671 0.0353 > tS 0.0006 0. 0.0455 0.0313 Table: Results of SP 2 estimation on the Samala corpus. Only sibilants are shown. (Heinz and Rogers 2010, p. 894) U. Toronto | 2019/07/19 Shibata & Heinz | 14

  34. The Co-Emission Product (definition) b : q 1 b b : q 2 b a : q 1 a a : q 2 a q 1 r 1 q 2 r 2 ⊗ c : q 1 c c : q 2 c U. Toronto | 2019/07/19 Shibata & Heinz | 15

  35. The Co-Emission Product (definition) b : q 1 b b : q 2 b a : q 1 a a : q 2 a q 1 r 1 q 2 r 2 ⊗ c : q 1 c c : q 2 c a q 1 , q 2 r 1 , r 2 = where � i q i a def P ( a | q 1 , q 2 ) = � � i q i σ σ U. Toronto | 2019/07/19 Shibata & Heinz | 15

  36. The Co-Emission Product (definition) b : q 1 b b : q 2 b a : q 1 a a : q 2 a q 1 r 1 q 2 r 2 ⊗ c : q 1 c c : q 2 c a q 1 , q 2 r 1 , r 2 = where � i q i a def P ( a | q 1 , q 2 ) = � � i q i σ σ For fixed σ , the co-emission product treats the parameters q i σ as independent . U. Toronto | 2019/07/19 Shibata & Heinz | 15

  37. Contributions 1 We extend Heinz and Rogers 2010 analysis to classes defined with U. Toronto | 2019/07/19 Shibata & Heinz | 16

  38. Contributions 1 We extend Heinz and Rogers 2010 analysis to classes defined with 1 the standard co-emission product (not the variant introduced by Heinz and Rogers) U. Toronto | 2019/07/19 Shibata & Heinz | 16

  39. Contributions 1 We extend Heinz and Rogers 2010 analysis to classes defined with 1 the standard co-emission product (not the variant introduced by Heinz and Rogers) 2 of arbitrary sets of finitely many PDFAs (not just the ones which define stochastic SP k languages) U. Toronto | 2019/07/19 Shibata & Heinz | 16

  40. Contributions 1 We extend Heinz and Rogers 2010 analysis to classes defined with 1 the standard co-emission product (not the variant introduced by Heinz and Rogers) 2 of arbitrary sets of finitely many PDFAs (not just the ones which define stochastic SP k languages) 2 Essentially, we prove that parameters which maximize the probability of the data with respect to such models are found by running the corpus through each of the individual factor PDFAs and calculating the relative frequencies. U. Toronto | 2019/07/19 Shibata & Heinz | 16

  41. Some details of the analysis 1 Probability of Words 2 Relative Frequency of Emissions 3 Empirical Mean of co-emission probabilities 4 Main Theorems U. Toronto | 2019/07/19 Shibata & Heinz | 17

  42. Probability of words • Consider a class C defined with the co-emission product of K machines M 1 . . . M K . U. Toronto | 2019/07/19 Shibata & Heinz | 18

  43. Probability of words • Consider a class C defined with the co-emission product of K machines M 1 . . . M K . • Suppose that w = σ 1 · · · σ N U. Toronto | 2019/07/19 Shibata & Heinz | 18

  44. Probability of words • Consider a class C defined with the co-emission product of K machines M 1 . . . M K . • Suppose that w = σ 1 · · · σ N • Let q ( j, i ) denote a state in Q j that is reached after M j reads the prefix σ 1 · · · σ i − 1 . U. Toronto | 2019/07/19 Shibata & Heinz | 18

  45. Probability of words • Consider a class C defined with the co-emission product of K machines M 1 . . . M K . • Suppose that w = σ 1 · · · σ N • Let q ( j, i ) denote a state in Q j that is reached after M j reads the prefix σ 1 · · · σ i − 1 . • If i = 1 then q ( j, i ) represents the initial state of M j . U. Toronto | 2019/07/19 Shibata & Heinz | 18

  46. Probability of words • Consider a class C defined with the co-emission product of K machines M 1 . . . M K . • Suppose that w = σ 1 · · · σ N • Let q ( j, i ) denote a state in Q j that is reached after M j reads the prefix σ 1 · · · σ i − 1 . • If i = 1 then q ( j, i ) represents the initial state of M j . • Let T j ( q, σ ) denote a parameter (transitional probabability) in PDFA M j . U. Toronto | 2019/07/19 Shibata & Heinz | 18

  47. Probability of words • Consider a class C defined with the co-emission product of K machines M 1 . . . M K . • Suppose that w = σ 1 · · · σ N • Let q ( j, i ) denote a state in Q j that is reached after M j reads the prefix σ 1 · · · σ i − 1 . • If i = 1 then q ( j, i ) represents the initial state of M j . • Let T j ( q, σ ) denote a parameter (transitional probabability) in PDFA M j . • Then the probability that σ is emitted after the product machine � 1 ≤ j ≤ K M j reads the prefix σ 1 · · · σ i − 1 is the following: � K j =1 T j ( q ( j, i ) , σ ) Coemit ( σ, i ) = (1) � K j =1 T j ( q ( j, i ) , σ ′ ) . � σ ′ ∈ Σ U. Toronto | 2019/07/19 Shibata & Heinz | 18

  48. Probability of words • Consider a class C defined with the co-emission product of K machines M 1 . . . M K . • Suppose that w = σ 1 · · · σ N • Let q ( j, i ) denote a state in Q j that is reached after M j reads the prefix σ 1 · · · σ i − 1 . • If i = 1 then q ( j, i ) represents the initial state of M j . • Let T j ( q, σ ) denote a parameter (transitional probabability) in PDFA M j . • Then the probability that σ is emitted after the product machine � 1 ≤ j ≤ K M j reads the prefix σ 1 · · · σ i − 1 is the following: � K j =1 T j ( q ( j, i ) , σ ) Coemit ( σ, i ) = (1) � K j =1 T j ( q ( j, i ) , σ ′ ) . � σ ′ ∈ Σ • We assume that there is a end marker ⋉ ∈ Σ which uniquely occurs at the end of words. U. Toronto | 2019/07/19 Shibata & Heinz | 18

  49. Probability of words • Consider a class C defined with the co-emission product of K machines M 1 . . . M K . • Suppose that w = σ 1 · · · σ N • Let q ( j, i ) denote a state in Q j that is reached after M j reads the prefix σ 1 · · · σ i − 1 . • If i = 1 then q ( j, i ) represents the initial state of M j . • Let T j ( q, σ ) denote a parameter (transitional probabability) in PDFA M j . • Then the probability that σ is emitted after the product machine � 1 ≤ j ≤ K M j reads the prefix σ 1 · · · σ i − 1 is the following: � K j =1 T j ( q ( j, i ) , σ ) Coemit ( σ, i ) = (1) � K j =1 T j ( q ( j, i ) , σ ′ ) . � σ ′ ∈ Σ • We assume that there is a end marker ⋉ ∈ Σ which uniquely occurs at the end of words. N +1 � P ( w ⋉ ) = Coemit ( σ i , i ) (2) U. Toronto | 2019/07/19 Shibata & Heinz | 18 i =1

  50. Relative Frequency of Emission U. Toronto | 2019/07/19 Shibata & Heinz | 19

  51. Relative Frequency of Emission • Let m w ( M j , q, σ ) ∈ Z + denote how many times σ is emitted at the state q while the machine M j emits w . U. Toronto | 2019/07/19 Shibata & Heinz | 19

  52. Relative Frequency of Emission • Let m w ( M j , q, σ ) ∈ Z + denote how many times σ is emitted at the state q while the machine M j emits w . • Let n w ( M j , q ) ∈ Z + denote how many times the state q is visited while the machine M j emits w . U. Toronto | 2019/07/19 Shibata & Heinz | 19

  53. Relative Frequency of Emission • Let m w ( M j , q, σ ) ∈ Z + denote how many times σ is emitted at the state q while the machine M j emits w . • Let n w ( M j , q ) ∈ Z + denote how many times the state q is visited while the machine M j emits w . Then freq w ( σ | M j , q ) = m w ( M j , q, σ ) n w ( M j , q ) , (3) represents the relative frequency that M j emits σ at q during emission of w . U. Toronto | 2019/07/19 Shibata & Heinz | 19

  54. Relative Frequency of Emission • Let m w ( M j , q, σ ) ∈ Z + denote how many times σ is emitted at the state q while the machine M j emits w . • Let n w ( M j , q ) ∈ Z + denote how many times the state q is visited while the machine M j emits w . Then freq w ( σ | M j , q ) = m w ( M j , q, σ ) n w ( M j , q ) , (3) represents the relative frequency that M j emits σ at q during emission of w . It is straightforward to lift this definition to data sequences D = � w 1 ⋉ , w 2 ⋉ , . . . w | D | ⋉ � by letting w = w 1 ⋉ w 2 ⋉ . . . w | D | ⋉ . U. Toronto | 2019/07/19 Shibata & Heinz | 19

  55. Empirical Mean of co-emission probabilities U. Toronto | 2019/07/19 Shibata & Heinz | 20

  56. Empirical Mean of co-emission probabilities � sumCoemit w ( σ, M j , q ) = Coemit ( σ, i ) . i s.t. q ( j,i )= q U. Toronto | 2019/07/19 Shibata & Heinz | 20

  57. Empirical Mean of co-emission probabilities � sumCoemit w ( σ, M j , q ) = Coemit ( σ, i ) . i s.t. q ( j,i )= q The empirical mean of a co-emission probability is defined as follows: Coemit w ( σ | M j , q ) = sumCoemit w ( σ, M j , q ) , (4) n w ( M j , q ) U. Toronto | 2019/07/19 Shibata & Heinz | 20

  58. Empirical Mean of co-emission probabilities � sumCoemit w ( σ, M j , q ) = Coemit ( σ, i ) . i s.t. q ( j,i )= q The empirical mean of a co-emission probability is defined as follows: Coemit w ( σ | M j , q ) = sumCoemit w ( σ, M j , q ) , (4) n w ( M j , q ) This is the sample average of the co-emission probability when q ∈ Q j is visited. U. Toronto | 2019/07/19 Shibata & Heinz | 20

  59. Main Theorem U. Toronto | 2019/07/19 Shibata & Heinz | 21

  60. Main Theorem Consider any parameter T j ( q, σ ) in PDFA M j . Theorem ∂P ( D ) /∂T j ( q, σ ) = 0 holds for all j if and only if the following equation is satisfied for all 1 ≤ j ≤ K : freq w ( σ | M j , q ) = Coemit w ( σ | M j , q ) . U. Toronto | 2019/07/19 Shibata & Heinz | 21

  61. Example a,b b a,b a a start start λ λ a,b a b start λ b Figure: The 2-set of of SD-PDFAs with Σ = { a, b } . There are 15 parameters. Suppose D = abb ⋉ bbb ⋉ . U. Toronto | 2019/07/19 Shibata & Heinz | 22

  62. Example freq D ( a |M λ , λ ) = 1 / 8 freq D ( a |M a , λ ) = 1 / 5 freq D ( a |M a , a ) = 0 / 3 , freq D ( b |M λ , λ ) = 5 / 8 freq D ( b |M a , λ ) = 3 / 5 freq D ( b |M a , a ) = 2 / 3 , freq D ( ⋉ |M λ , λ ) = 2 / 8 freq D ( ⋉ |M a , λ ) = 1 / 5 freq D ( ⋉ |M a , a ) = 1 / 3 , freq D ( a |M b , λ ) = 1 / 3 freq D ( a |M b , b ) = 3 / 5 , freq D ( b |M b , λ ) = 2 / 3 freq D ( b |M b , b ) = 0 / 5 , freq D ( ⋉ |M b , λ ) = 0 / 3 freq D ( ⋉ |M b , b ) = 2 / 5 , Figure: Frequency computations with D = abb ⋉ bbb ⋉ and the 2-set of of SD-PDFAs on previous slide. U. Toronto | 2019/07/19 Shibata & Heinz | 22

  63. Convexity of the Negative Log Likelihood U. Toronto | 2019/07/19 Shibata & Heinz | 23

  64. Convexity of the Negative Log Likelihood Let τ j,q,σ denote log T j ( q, σ ); i.e. the log of a parameter of C defined with � j M j . U. Toronto | 2019/07/19 Shibata & Heinz | 23

  65. Convexity of the Negative Log Likelihood Let τ j,q,σ denote log T j ( q, σ ); i.e. the log of a parameter of C defined with � j M j . Then τ can be thought of as a vector in R n where n is the number of parameters. U. Toronto | 2019/07/19 Shibata & Heinz | 23

  66. Convexity of the Negative Log Likelihood Let τ j,q,σ denote log T j ( q, σ ); i.e. the log of a parameter of C defined with � j M j . Then τ can be thought of as a vector in R n where n is the number of parameters. Theorem − log P ( w ⋉ ) is convex with respect to τ ∈ R n . U. Toronto | 2019/07/19 Shibata & Heinz | 23

  67. Convexity of the Negative Log Likelihood Let τ j,q,σ denote log T j ( q, σ ); i.e. the log of a parameter of C defined with � j M j . Then τ can be thought of as a vector in R n where n is the number of parameters. Theorem − log P ( w ⋉ ) is convex with respect to τ ∈ R n . Thus the solution obtained by the previous theorem is a MLE. U. Toronto | 2019/07/19 Shibata & Heinz | 23

  68. Discussion U. Toronto | 2019/07/19 Shibata & Heinz | 24

  69. Discussion At a high level, the problem we considered is a decomposition of complex probability distributions into simpler factors. U. Toronto | 2019/07/19 Shibata & Heinz | 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend