expectation maximization
play

Expectation maximization Subhransu Maji CMPSCI 689: Machine - PowerPoint PPT Presentation

Expectation maximization Subhransu Maji CMPSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is no money to label the data. You


  1. Expectation maximization Subhransu Maji CMPSCI 689: Machine Learning 14 April 2015

  2. Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is no money to label the data. � ‣ You have a probabilistic model that assumes labelled data, but you don't have any labels. Can you still do something? � Amazingly you can! � ‣ Treat the labels as hidden variables and try to learn them simultaneously along with the parameters of the model � Expectation Maximization (EM) � ‣ A broad family of algorithms for solving hidden variable problems ‣ In today’s lecture we will derive EM algorithms for clustering and naive Bayes classification and learn why EM works CMPSCI 689 Subhransu Maji (UMASS) 2 /19

  3. Gaussian mixture model for clustering Suppose data comes from a Gaussian Mixture Model (GMM) — you have K clusters and the data from the cluster k is drawn from a Gaussian with mean μ k and variance σ k2 � We will assume that the data comes with labels (we will soon remove this assumption) � Generative story of the data: � ‣ For each example n = 1, 2, .., N ➡ Choose a label y n ∼ Mult( θ 1 , θ 2 , . . . , θ K ) x n ∼ N ( µ k , σ 2 ➡ Choose example k ) Likelihood of the data: N N Y Y θ y n N ( x n ; µ y n , σ 2 p ( D ) = p ( y n ) p ( x n | y n ) = y n ) n =1 n =1 N − || x n − µ y n || 2 ✓ ◆ � − D 2 exp Y 2 πσ 2 � p ( D ) = θ y n y n 2 σ 2 y n n =1 CMPSCI 689 Subhransu Maji (UMASS) 3 /19

  4. GMM: known labels Likelihood of the data: � � N − || x n − µ y n || 2 ✓ ◆ � − D 2 exp Y 2 πσ 2 � p ( D ) = θ y n � y n 2 σ 2 y n n =1 � If you knew the labels y n then the maximum-likelihood estimates of the parameters is easy: θ k = 1 fraction of examples X [ y n = k ] N with label k n P n [ y n = k ] x n mean of all the µ k = P n [ y n = k ] examples with label k n [ y n = k ] || x n − µ k || 2 P variance of all the σ 2 k = examples with label k P n [ y n = k ] CMPSCI 689 Subhransu Maji (UMASS) 4 /19

  5. GMM: unknown labels Now suppose you didn’t have labels y n . Analogous to k-means, one solution is to iterate. Start by guessing the parameters and then repeat the two steps: � ‣ Estimate labels given the parameters ‣ Estimate parameters given the labels � In k-means we assigned each point to a single cluster, also called as hard assignment (point 10 goes to cluster 2) � In expectation maximization (EM) we will will use soft assignment (point 10 goes half to cluster 2 and half to cluster 5) � � Lets define a random variable z n = [z 1 , z 2 , …, z K ] to denote the assignment vector for the n th point � ‣ Hard assignment: only one of z k is 1, the rest are 0 ‣ Soft assignment: z k is positive and sum to 1 CMPSCI 689 Subhransu Maji (UMASS) 5 /19

  6. GMM: parameter estimation Formally z n,k is the probability that the n th point goes to cluster k � � z n,k = p ( y n = k | x n ) � = P ( y n = k, x n ) � P ( x n ) � ∝ P ( y n = k ) P ( x n | y n ) = θ k N ( x n ; µ k , σ 2 k ) � Given a set of parameters ( θ k , μ k, σ k2 ), z n,k is easy to compute � Given z n,k , we can update the parameters ( θ k , μ k, σ k2 ) as: θ k = 1 X fraction of examples z n,k N with label k n P n z n,k x n mean of all the fractional µ k = P examples with label k n z n,k n z n,k || x n − µ k || 2 P variance of all the fractional σ 2 k = P examples with label k n z n,k CMPSCI 689 Subhransu Maji (UMASS) 6 /19

  7. GMM: example We have replaced the indicator variable [y n = k] with p(y n =k) which is the expectation of [y n =k]. This is our guess of the labels. � Just like k-means the EM is susceptible to local minima. � Clustering example: k-means GMM http://nbviewer.ipython.org/github/NICTA/MLSS/tree/master/clustering/ CMPSCI 689 Subhransu Maji (UMASS) 7 /19

  8. The EM framework We have data with observations x n and hidden variables y n , and would like to estimate parameters θ � The likelihood of the data and hidden variables: � Y � p ( D ) = p ( x n , y n | θ ) � n Only x n are known so we can compute the data likelihood by marginalizing out the y n : � � Y X p ( X | θ ) = p ( x n , y n | θ ) � n y n � Parameter estimation by maximizing log-likelihood: X ! X θ ML ← arg max log p ( x n , y n | θ ) θ n y n hard to maximize since the sum is inside the log CMPSCI 689 Subhransu Maji (UMASS) 8 /19

  9. Jensen’s inequality Given a concave function f and a set of weights λ i ≥ 0 and ∑ᵢ λᵢ = 1 � Jensen’s inequality states that f( ∑ᵢ λᵢ x ᵢ ) ≥ ∑ᵢ λᵢ f(x ᵢ ) � This is a direct consequence of concavity � ‣ f(ax + by) ≥ a f(x) + b f(y) when a ≥ 0, b ≥ 0, a + b = 1 f(y) f(ax+by) a f(x) + b f(y) f(x) CMPSCI 689 Subhransu Maji (UMASS) 9 /19

  10. The EM framework Construct a lower bound the log-likelihood using Jensen’s inequality � X ! � X L ( X | θ ) = log p ( x n , y n | θ ) � x n y n f X ! � q ( y n ) p ( x n , y n | θ ) X = log Jensen’s inequality � q ( y n ) n y n � λ ✓ p ( x n , y n | θ ) ◆ X X q ( y n ) log ≥ � q ( y n ) n y n � X X = [ q ( y n ) log p ( x n , y n | θ ) − q ( y n ) log q ( y n )] � n y n � , ˆ L ( X | θ ) � Maximize the lower bound: independent of θ X X θ ← arg max q ( y n ) log p ( x n , y n | θ ) θ n y n CMPSCI 689 Subhransu Maji (UMASS) 10 /19

  11. Lower bound illustrated Maximizing the lower bound increases the value of the original function if the lower bound touches the function at the current value L ( X | θ ) ˆ L ( X | θ t +1 ) ˆ L ( X | θ t ) θ t +1 θ t CMPSCI 689 Subhransu Maji (UMASS) 11 /19

  12. An optimal lower bound Any choice of the probability distribution q(y n ) is valid as long as the lower bound touches the function at the current estimate of θ� L ( X | θ t ) = ˆ � L ( X | θ t ) We can the pick the optimal q(y n ) by maximizing the lower bound � � X arg max [ q ( y n ) log p ( x n , y n | θ ) − q ( y n ) log q ( y n )] q � y n q ( y n ) ← p ( y n | x n , θ t ) This gives us � ‣ Proof: use Lagrangian multipliers with “sum to one” constraint � This is the distributions of the hidden variables conditioned on the data and the current estimate of the parameters � ‣ This is exactly what we computed in the GMM example CMPSCI 689 Subhransu Maji (UMASS) 12 /19

  13. The EM algorithm We have data with observations x n and hidden variables y n , and would like to estimate parameters θ of the distribution p(x | θ ) � EM algorithm � ‣ Initialize the parameters θ randomly ‣ Iterate between the following two steps: ➡ E step: Compute probability distribution over the hidden variables q ( y n ) ← p ( y n | x n , θ ) � ➡ M step: Maximize the lower bound X X � θ ← arg max q ( y n ) log p ( x n , y n | θ ) θ � n y n � EM algorithm is a great candidate when M-step can done easily but p(x | θ ) cannot be easily optimized over θ� ‣ For e.g. for GMMs it was easy to compute means and variances given the memberships CMPSCI 689 Subhransu Maji (UMASS) 13 /19

  14. Naive Bayes: revisited Consider the binary prediction problem � Let the data be distributed according to a probability distribution: � � p θ ( y, x ) = p θ ( y, x 1 , x 2 , . . . , x D ) � We can simplify this using the chain rule of probability: � p θ ( y, x ) = p θ ( y ) p θ ( x 1 | y ) p θ ( x 2 | x 1 , y ) . . . p θ ( x D | x 1 , x 2 , . . . , x D − 1 , y ) � � D Y = p θ ( y ) p θ ( x d | x 1 , x 2 , . . . , x d − 1 , y ) � � d =1 Naive Bayes assumption: � � p θ ( x d | x d 0 , y ) = p θ ( x d | y ) , 8 d 0 6 = d � � E.g., The words “free” and “money” are independent given spam CMPSCI 689 Subhransu Maji (UMASS) 14 /19

  15. Naive Bayes: a simple case Case: binary labels and binary features � } p θ ( y ) = Bernoulli ( θ 0 ) � � p θ ( x d | y = 1) = Bernoulli ( θ + d ) 1+2D parameters � p θ ( x d | y = − 1) = Bernoulli ( θ − d ) � Probability of the data: D Y p θ ( y, x ) = p θ ( y ) p θ ( x d | y ) d =1 = θ [ y =+1] (1 − θ 0 ) [ y = − 1] 0 D θ +[ x d =1 ,y =+1] Y (1 − θ + d ) [ x d =0 ,y =+1] ... × // label +1 d d =1 D θ − [ x d =1 ,y = − 1] Y d ) [ x d =0 ,y = − 1] ... × (1 − θ − // label -1 d d =1 CMPSCI 689 Subhransu Maji (UMASS) 15 /19

  16. Naive Bayes: parameter estimation Given data we can estimate the parameters by maximizing data likelihood � The maximum likelihood estimates are: P n [ y n = +1] // fraction of the data with label as +1 ˆ θ 0 = N P n [ x d,n = 1 , y n = +1] ˆ // fraction of the instances with 1 among +1 θ + d = P n [ y n = +1] P n [ x d,n = 1 , y n = − 1] ˆ // fraction of the instances with 1 among -1 d = θ − P n [ y n = − 1] CMPSCI 689 Subhransu Maji (UMASS) 16 /19

  17. Naive Bayes: EM Now suppose you don’t have labels y n � Initialize the parameters θ randomly � E step: compute the distribution over the hidden variables q(y n ) � D � θ +[ x d,n =1] q ( y n = 1) = p ( y n = +1 | x n , θ ) ∝ θ + Y (1 − θ + d ) [ x d,n =0] 0 d � d =1 M step: estimate θ given the guesses P n q ( y n = 1) // fraction of the data with label as +1 θ 0 = N P n [ x d,n = 1] q ( y n = 1) θ + d = // fraction of the instances with 1 among +1 P n q ( y n = 1) P n [ x d,n = 1] q ( y n = − 1) // fraction of the instances with 1 among -1 d = θ − P n q ( y n = − 1) CMPSCI 689 Subhransu Maji (UMASS) 17 /19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend