overview
play

Overview Maximum-Likelihood Estimation Models with hidden variables - PowerPoint PPT Presentation

Overview Maximum-Likelihood Estimation Models with hidden variables 6.864 (Fall 2007) The EM algorithm for a simple example (3 coins) The EM Algorithm, Part I The general form of the EM algorithm Hidden Markov models 1 3 An


  1. Overview • Maximum-Likelihood Estimation • Models with hidden variables 6.864 (Fall 2007) • The EM algorithm for a simple example (3 coins) The EM Algorithm, Part I • The general form of the EM algorithm • Hidden Markov models 1 3 An Experiment/Some Intuition Maximum Likelihood Estimation • I have three coins in my pocket, • We have data points x 1 , x 2 , . . . x n drawn from some set X Coin 0 has probability λ of heads; Coin 1 has probability p 1 of heads; • We have a parameter vector Θ Coin 2 has probability p 2 of heads • For each trial I do the following: • We have a parameter space Ω First I toss Coin 0 If Coin 0 turns up heads , I toss coin 1 three times If Coin 0 turns up tails , I toss coin 2 three times • We have a distribution P ( x | Θ) for any Θ ∈ Ω , such that � P ( x | Θ) = 1 and P ( x | Θ) ≥ 0 for all x I don’t tell you whether Coin 0 came up heads or tails, or whether Coin 1 or 2 was tossed three times, x ∈X but I do tell you how many heads/tails are seen at each trial • you see the following sequence: • We assume that our data points x 1 , x 2 , . . . x n are drawn � HHH � , � TTT � , � HHH � , � TTT � , � HHH � at random (independently, identically distributed) from a distribution P ( x | Θ ∗ ) for some Θ ∗ ∈ Ω What would you estimate as the values for λ, p 1 and p 2 ? 2 4

  2. Log-Likelihood Maximum Likelihood Estimation • We have data points x 1 , x 2 , . . . x n drawn from some set X • Given a sample x 1 , x 2 , . . . x n , choose � Θ ML = argmax Θ ∈ Ω L (Θ) = argmax Θ ∈ Ω log P ( x i | Θ) • We have a parameter vector Θ , and a parameter space Ω i • We have a distribution P ( x | Θ) for any Θ ∈ Ω • For example, take the coin example: say x 1 . . . x n has Count ( H ) heads, and ( n − Count ( H )) tails • The likelihood is ⇒ n � Likelihood (Θ) = P ( x 1 , x 2 , . . . x n | Θ) = P ( x i | Θ) � Θ Count ( H ) × (1 − Θ) n − Count ( H ) � L (Θ) = log i =1 = Count ( H ) log Θ + ( n − Count ( H )) log(1 − Θ) • The log-likelihood is • We now have n � Θ ML = Count ( H ) L (Θ) = log Likelihood (Θ) = log P ( x i | Θ) i =1 n 5 7 A First Example: Coin Tossing A Second Example: Probabilistic Context-Free Grammars • X = { H,T } . Our data points x 1 , x 2 , . . . x n are a sequence of • X is the set of all parse trees generated by the underlying heads and tails, e.g. context-free grammar. Our sample is n trees T 1 . . . T n such that each T i ∈ X . HHTTHHHTHH • R is the set of rules in the context free grammar N is the set of non-terminals in the grammar • Parameter vector Θ is a single parameter, i.e., the probability of coin coming up heads • Θ r for r ∈ R is the parameter for rule r • Parameter space Ω = [0 , 1] • Let R ( α ) ⊂ R be the rules of the form α → β for some α • Distribution P ( x | Θ) is defined as • The parameter space Ω is the set of Θ ∈ [0 , 1] | R | such that � Θ If x = H P ( x | Θ) = 1 − Θ If x = T � for all α ∈ N Θ r = 1 r ∈ R ( α ) 6 8

  3. Multinomial Distributions • We have Θ Count ( T,r ) � • X is a finite set, e.g., X = { dog , cat , the , saw } P ( T | Θ) = r r ∈ R • Our sample x 1 , x 2 , . . . x n is drawn from X where Count ( T, r ) is the number of times rule r is seen in the tree T e.g., x 1 , x 2 , x 3 = dog , the , saw � ⇒ log P ( T | Θ) = Count ( T, r ) log Θ r • The parameter Θ is a vector in R m where m = |X| r ∈ R e.g., Θ 1 = P ( dog ) , Θ 2 = P ( cat ) , Θ 3 = P ( the ) , Θ 4 = P ( saw ) • The parameter space is m � Ω = { Θ : Θ i = 1 and ∀ i, Θ i ≥ 0 } i =1 • If our sample is x 1 , x 2 , x 3 = dog , the , saw , then L (Θ) = log P ( x 1 , x 2 , x 3 = dog , the , saw ) = log Θ 1 +log Θ 3 +log Θ 4 9 11 Maximum Likelihood Estimation for PCFGs Overview • We have • Maximum-Likelihood Estimation � log P ( T | Θ) = Count ( T, r ) log Θ r r ∈ R • Models with hidden variables where Count ( T, r ) is the number of times rule r is seen in the tree T • The EM algorithm for a simple example (3 coins) • And, � � � L (Θ) = log P ( T i | Θ) = Count ( T i , r ) log Θ r • The general form of the EM algorithm i i r ∈ R • Hidden Markov models • Solving Θ ML = argmax Θ ∈ Ω L (Θ) gives � i Count ( T i , r ) Θ r = � � s ∈ R ( α ) Count ( T i , s ) i where r is of the form α → β for some β 10 12

  4. Models with Hidden Variables Overview • Now say we have two sets X and Y , and a joint distribution • Maximum-Likelihood Estimation P ( x, y | Θ) • Models with hidden variables • If we had fully observed data , ( x i , y i ) pairs, then � L (Θ) = log P ( x i , y i | Θ) • The EM algorithm for a simple example (3 coins) i • The general form of the EM algorithm • If we have partially observed data , x i examples, then � • Hidden Markov models L (Θ) = log P ( x i | Θ) i � � = log P ( x i , y | Θ) i y ∈Y 13 15 The Three Coins Example • The EM (Expectation Maximization) algorithm is a method for finding • e.g., in the three coins example: Y = { H,T } � � Θ ML = argmax Θ log P ( x i , y | Θ) X = { HHH,TTT,HTT,THH,HHT,TTH,HTH,THT } y ∈Y i Θ = { λ, p 1 , p 2 } • and P ( x, y | Θ) = P ( y | Θ) P ( x | y, Θ) where � λ If y = H P ( y | Θ) = 1 − λ If y = T and � p h 1 (1 − p 1 ) t If y = H P ( x | y, Θ) = p h 2 (1 − p 2 ) t If y = T where h = number of heads in x , t = number of tails in x 14 16

  5. The Three Coins Example The Three Coins Example • Various probabilities can be calculated, for example: • Various probabilities can be calculated, for example: λp 1 (1 − p 1 ) 2 λp 1 (1 − p 1 ) 2 P ( x = THT , y = H | Θ) = P ( x = THT , y = H | Θ) = (1 − λ ) p 2 (1 − p 2 ) 2 (1 − λ ) p 2 (1 − p 2 ) 2 P ( x = THT , y = T | Θ) = P ( x = THT , y = T | Θ) = P ( x = THT | Θ) = P ( x = THT , y = H | Θ) P ( x = THT | Θ) = P ( x = THT , y = H | Θ) + P ( x = THT , y = T | Θ) + P ( x = THT , y = T | Θ) λp 1 (1 − p 1 ) 2 + (1 − λ ) p 2 (1 − p 2 ) 2 λp 1 (1 − p 1 ) 2 + (1 − λ ) p 2 (1 − p 2 ) 2 = = P ( x = THT , y = H | Θ) P ( x = THT , y = H | Θ) P ( y = H | x = THT , Θ) = P ( y = H | x = THT , Θ) = P ( x = THT | Θ) P ( x = THT | Θ) λp 1 (1 − p 1 ) 2 λp 1 (1 − p 1 ) 2 = = λp 1 (1 − p 1 ) 2 + (1 − λ ) p 2 (1 − p 2 ) 2 λp 1 (1 − p 1 ) 2 + (1 − λ ) p 2 (1 − p 2 ) 2 17 19 The Three Coins Example The Three Coins Example • Various probabilities can be calculated, for example: • Various probabilities can be calculated, for example: λp 1 (1 − p 1 ) 2 λp 1 (1 − p 1 ) 2 P ( x = THT , y = H | Θ) = P ( x = THT , y = H | Θ) = (1 − λ ) p 2 (1 − p 2 ) 2 (1 − λ ) p 2 (1 − p 2 ) 2 P ( x = THT , y = T | Θ) = P ( x = THT , y = T | Θ) = P ( x = THT | Θ) = P ( x = THT , y = H | Θ) P ( x = THT | Θ) = P ( x = THT , y = H | Θ) + P ( x = THT , y = T | Θ) + P ( x = THT , y = T | Θ) λp 1 (1 − p 1 ) 2 + (1 − λ ) p 2 (1 − p 2 ) 2 λp 1 (1 − p 1 ) 2 + (1 − λ ) p 2 (1 − p 2 ) 2 = = P ( x = THT , y = H | Θ) P ( x = THT , y = H | Θ) P ( y = H | x = THT , Θ) = P ( y = H | x = THT , Θ) = P ( x = THT | Θ) P ( x = THT | Θ) λp 1 (1 − p 1 ) 2 λp 1 (1 − p 1 ) 2 = = λp 1 (1 − p 1 ) 2 + (1 − λ ) p 2 (1 − p 2 ) 2 λp 1 (1 − p 1 ) 2 + (1 − λ ) p 2 (1 − p 2 ) 2 18 20

  6. The Three Coins Example The Three Coins Example • Partially observed data might look like: • Fully observed data might look like: � HHH � , � TTT � , � HHH � , � TTT � , � HHH � ( � HHH � , H ) , ( � TTT � , T ) , ( � HHH � , H ) , ( � TTT � , T ) , ( � HHH � , H ) • If current parameters are λ, p 1 , p 2 P ( � HHH � , H ) P ( y = H | x = � HHH � ) = P ( � HHH � , H ) + P ( � HHH � , T ) • In this case maximum likelihood estimates are: λp 3 1 = λ = 3 λp 3 1 + (1 − λ ) p 3 2 5 p 1 = 9 P ( � TTT � , H ) 9 P ( y = H | x = � TTT � ) = P ( � TTT � , H ) + P ( � TTT � , T ) p 2 = 0 λ (1 − p 1 ) 3 6 = λ (1 − p 1 ) 3 + (1 − λ )(1 − p 2 ) 3 21 23 The Three Coins Example The Three Coins Example • If current parameters are λ, p 1 , p 2 • Partially observed data might look like: λp 3 1 P ( y = H | x = � HHH � ) = λp 3 1 + (1 − λ ) p 3 � HHH � , � TTT � , � HHH � , � TTT � , � HHH � 2 λ (1 − p 1 ) 3 • How do we find the maximum likelihood parameters? P ( y = H | x = � TTT � ) = λ (1 − p 1 ) 3 + (1 − λ )(1 − p 2 ) 3 • If λ = 0 . 3 , p 1 = 0 . 3 , p 2 = 0 . 6 : P ( y = H | x = � HHH � ) = 0 . 0508 P ( y = H | x = � TTT � ) = 0 . 6967 22 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend