coms 4721 machine learning for data science lecture 15 3
play

COMS 4721: Machine Learning for Data Science Lecture 15, 3/23/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 15, 3/23/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University M AXIMUM LIKELIHOOD A PPROACHES TO DATA MODELING Our approaches to


  1. COMS 4721: Machine Learning for Data Science Lecture 15, 3/23/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

  2. M AXIMUM LIKELIHOOD

  3. A PPROACHES TO DATA MODELING Our approaches to modeling data thus far have been either probabilistic or non-probabilistic in motivation. ◮ Probabilistic models: Probability distributions defined on data, e.g., 1. Bayes classifiers 2. Logistic regression 3. Least squares and ridge regression (using ML and MAP interpretation) 4. Bayesian linear regression ◮ Non-probabilistic models: No probability distributions involved, e.g., 1. Perceptron 2. Support vector machine 3. Decision trees 4. K-means In every case, we have some objective function we are trying to optimize (greedily vs non-greedily, locally vs globally).

  4. M AXIMUM LIKELIHOOD As we’ve seen, one probabilistic objective function is maximum likelihood. Setup: In the most basic scenario, we start with 1. some set of model parameters θ 2. a set of data { x 1 , . . . , x n } 3. a probability distribution p ( x | θ ) iid 4. an i.i.d. assumption, x i ∼ p ( x | θ ) Maximum likelihood seeks the θ that maximizes the likelihood n n ( a ) ( b ) � � θ ML = arg max p ( x 1 , . . . , x n | θ ) = arg max p ( x i | θ ) = arg max ln p ( x i | θ ) θ θ θ i = 1 i = 1 (a) follows from i.i.d. assumption. (b) follows since f ( y ) > f ( x ) ⇒ ln f ( y ) > ln f ( x ) .

  5. M AXIMUM LIKELIHOOD We’ve discussed maximum likelihood for a few models, e.g., least squares linear regression and the Bayes classifier. Both of these models were “nice” because we could find their respective θ ML analytically by writing an equation and plugging in data to solve. Gaussian with unknown mean and covariance iid ∼ N ( µ, Σ) , where θ = { µ, Σ } , then In the first lecture, we saw if x i � n ∇ θ ln p ( x i | θ ) = 0 i = 1 gives the following maximum likelihood values for µ and Σ : � n � n µ ML = 1 Σ ML = 1 ( x i − µ ML )( x i − µ ML ) T x i , n n i = 1 i = 1

  6. C OORDINATE ASCENT AND MAXIMUM LIKELIHOOD In more complicated models, we might split the parameters into groups θ 1 , θ 2 and try to maximize the likelihood over both of these, � n θ 1 , ML , θ 2 , ML = arg max ln p ( x i | θ 1 , θ 2 ) , θ 1 ,θ 2 i = 1 Although we can solve one given the other, we can’t solve it simultaneously . Coordinate ascent (probabilistic version) We saw how K-means presented a similar situation, and that we could optimize using coordinate ascent. This technique is generalizable. Algorithm : For iteration t = 1 , 2 , . . . , � n 1. Optimize θ ( t ) i = 1 ln p ( x i | θ 1 , θ ( t − 1 ) = arg max θ 1 ) 1 2 � n 2. Optimize θ ( t ) i = 1 ln p ( x i | θ ( t ) = arg max θ 2 1 , θ 2 ) 2

  7. C OORDINATE ASCENT AND MAXIMUM LIKELIHOOD There is a third (subtly) different situation, where we really want to find � n θ 1 , ML = arg max ln p ( x i | θ 1 ) . θ 1 i = 1 Except this function is “tricky” to optimize directly. However, we figure out that we can add a second variable θ 2 such that n � ln p ( x i , θ 2 | θ 1 ) ( Function 2 ) i = 1 is easier to work with. We’ll make this clearer later. ◮ Notice in this second case that θ 2 is on the left side of the conditioning bar. This implies a prior on θ 2 , (whatever “ θ 2 ” turns out to be). ◮ We will next discuss a fundamental technique called the EM algorithm for finding θ 1 , ML by using Function 2 instead.

  8. E XPECTATION -M AXIMIZATION A LGORITHM

  9. A MOTIVATING EXAMPLE Let x i ∈ R d , be a vector with missing data . Split this vector into two parts: 1. x o i – observed portion (the sub-vector of x i that is measured) 2. x m i – missing portion (the sub-vector of x i that is still unknown) 3. The missing dimensions can be different for different x i . iid ∼ N ( µ, Σ) , and want to solve We assume that x i n � ln p ( x o µ ML , Σ ML = arg max i | µ, Σ) . µ, Σ i = 1 This is tricky. However, if we knew x m i (and therefore x i ), then n � ln p ( x o i , x m µ ML , Σ ML = arg max i | µ, Σ) � �� � µ, Σ i = 1 = p ( x i | µ, Σ) is very easy to optimize (we just did it on a previous slide).

  10. C ONNECTING TO A MORE GENERAL SETUP We will discuss a method for optimizing � n i = 1 ln p ( x o i | µ, Σ) and imputing its missing values { x m 1 , . . . , x m n } . This is a very general technique. General setup Imagine we have two parameter sets θ 1 , θ 2 , where � p ( x | θ 1 ) = p ( x , θ 2 | θ 1 ) d θ 2 ( marginal distribution ) Example: For the previous example we can show that � p ( x o p ( x o i , x m i | µ, Σ) dx m i = N ( µ o i , Σ o i | µ, Σ) = i ) , where µ o i and Σ o i are the sub-vector/sub-matrix of µ and Σ defined by x o i .

  11. T HE EM OBJECTIVE FUNCTION We need to define a general objective function that gives us what we want: 1. It lets us optimize the marginal p ( x | θ 1 ) over θ 1 , 2. It uses p ( x , θ 2 | θ 1 ) in doing so purely for computational convenience. The EM objective function Before picking it apart, we claim that this objective function is � � q ( θ 2 ) ln p ( x , θ 2 | θ 1 ) q ( θ 2 ) ln p ( x | θ 1 ) = d θ 2 + q ( θ 2 ) ln p ( θ 2 | x , θ 1 ) d θ 2 q ( θ 2 ) Some immediate comments: ◮ q ( θ 2 ) is any probability distribution (assumed continuous for now) ◮ We assume we know p ( θ 2 | x , θ 1 ) . That is, given the data x and fixed values for θ 1 , we can solve the conditional posterior distribution of θ 2 .

  12. D ERIVING THE EM OBJECTIVE FUNCTION Let’s show that this equality is actually true � � q ( θ 2 ) ln p ( x , θ 2 | θ 1 ) q ( θ 2 ) ln p ( x | θ 1 ) = d θ 2 + q ( θ 2 ) ln p ( θ 2 | x , θ 1 ) d θ 2 q ( θ 2 ) � q ( θ 2 ) ln p ( x , θ 2 | θ 1 ) q ( θ 2 ) = p ( θ 2 | x , θ 1 ) q ( θ 2 ) d θ 2 Remember some rules of probability: p ( b | c ) = p ( a , b | c ) p ( a , b | c ) = p ( a | b , c ) p ( b | c ) ⇒ p ( a | b , c ) . Letting a = θ 1 , b = x and c = θ 1 , we conclude � ln p ( x | θ 1 ) = q ( θ 2 ) ln p ( x | θ 1 ) d θ 2 = ln p ( x | θ 1 )

  13. T HE EM OBJECTIVE FUNCTION The EM objective function splits our desired objective into two terms: � � q ( θ 2 ) ln p ( x , θ 2 | θ 1 ) q ( θ 2 ) ln p ( x | θ 1 ) = + q ( θ 2 ) ln d θ 2 p ( θ 2 | x , θ 1 ) d θ 2 q ( θ 2 ) � �� � � �� � A function only of θ 1 , we’ll call it L Kullback-Leibler divergence Some more observations about the right hand side: 1. The KL diverence is always ≥ 0 and only = 0 when q = p . 2. We are assuming that the integral in L can be calculated, leaving a function only of θ 1 (for a particular setting of the distribution q ).

  14. B IGGER PICTURE Q : What does it mean to iteratively optimize ln p ( x | θ 1 ) w.r.t. θ 1 ? A : One way to think about it is that we want a method for generating: 1. A sequence of values for θ 1 such that ln p ( x | θ ( t ) 1 ) ≥ ln p ( x | θ ( t − 1 ) ) . 1 2. We want θ ( t ) to converge to a local maximum of ln p ( x | θ 1 ) . 1 It doesn’t matter how we generate the sequence θ ( 1 ) 1 , θ ( 2 ) 1 , θ ( 3 ) 1 , . . . We will show how EM generates # 1 and just mention that EM satisfies # 2.

  15. T HE EM ALGORITHM The EM objective function � � q ( θ 2 ) ln p ( x , θ 2 | θ 1 ) q ( θ 2 ) ln p ( x | θ 1 ) = + q ( θ 2 ) ln d θ 2 p ( θ 2 | x , θ 1 ) d θ 2 q ( θ 2 ) � �� � � �� � define this to be L ( x , θ 1 ) Kullback-Leibler divergence Definition: The EM algorithm Given the value θ ( t ) 1 , find the value θ ( t + 1 ) as follows: 1 E-step : Set q t ( θ 2 ) = p ( θ 2 | x , θ ( t ) 1 ) and calculate � � L q t ( x , θ 1 ) = q t ( θ 2 ) ln p ( x , θ 2 | θ 1 ) d θ 2 − q t ( θ 2 ) ln q t ( θ 2 ) d θ 2 . � �� � can ignore this term M-step : Set θ ( t + 1 ) = arg max θ 1 L q t ( x , θ 1 ) . 1

  16. P ROOF OF MONOTONIC IMPROVEMENT Once we’re comfortable with the moving parts, the proof that the sequence θ ( t ) monotonically improves ln p ( x | θ 1 ) just requires analysis : 1 � � ln p ( x | θ ( t ) L ( x , θ ( t ) q ( θ 2 ) � p ( θ 2 | x 1 , θ ( t ) 1 ) = 1 ) + KL 1 ) � �� � = 0 by setting q = p L q t ( x , θ ( t ) = 1 ) ← E-step L q t ( x , θ ( t + 1 ) ≤ ) ← M-step 1 � � L q t ( x , θ ( t + 1 ) q t ( θ 2 ) � p ( θ 2 | x 1 , θ ( t + 1 ) ≤ ) + KL ) 1 1 � �� � > 0 because q � = p � � L ( x , θ ( t + 1 ) q ( θ 2 ) � p ( θ 2 | x 1 , θ ( t + 1 ) = ) + KL ) 1 1 ln p ( x | θ ( t + 1 ) = ) 1

  17. O NE ITERATION OF EM Start : Current setting of θ 1 and q ( θ 2 ) For reference : } ln p ( x | θ 1 ) = L + KL KL(q| |p) lnp(X|θ 1 ) � q ( θ 2 ) ln p ( x , θ 2 | θ 1 ) L = d θ 2 q ( θ 2 ) q ( θ 2 ) � = q ( θ 2 ) ln KL p ( θ 2 | x , θ 1 ) d θ 2 (X|θ 1 ) L Some arbitrary point < 0

  18. O NE ITERATION OF EM E-step : Set q ( θ 2 ) = p ( θ 2 | x , θ 1 ) and update L . For reference : KL(q| |p) = 0 ln p ( x | θ 1 ) = L + KL (X|θ 1 ) L lnp(X|θ 1 ) � q ( θ 2 ) ln p ( x , θ 2 | θ 1 ) L = d θ 2 q ( θ 2 ) q ( θ 2 ) � = q ( θ 2 ) ln KL p ( θ 2 | x , θ 1 ) d θ 2 Some arbitrary point < 0

  19. O NE ITERATION OF EM M-step : Maximize L wrt θ 1 . Now q � = p . } KL(q| |p) For reference : ln p ( x | θ 1 ) = L + KL up ) lnp(X|θ 1 up ) (X|θ 1 L � q ( θ 2 ) ln p ( x , θ 2 | θ 1 ) L = d θ 2 q ( θ 2 ) q ( θ 2 ) � = q ( θ 2 ) ln KL p ( θ 2 | x , θ 1 ) d θ 2 Some arbitrary point < 0

  20. EM FOR MISSING DATA

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend