coms 4721 machine learning for data science lecture 9 2
play

COMS 4721: Machine Learning for Data Science Lecture 9, 2/16/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 9, 2/16/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University L OGISTIC REGRESSION B INARY CLASSIFICATION Linear classifiers Given:


  1. COMS 4721: Machine Learning for Data Science Lecture 9, 2/16/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

  2. L OGISTIC REGRESSION

  3. B INARY CLASSIFICATION Linear classifiers Given: Data ( x 1 , y 1 ) , . . . , ( x n , y n ) , where x i ∈ R d and y i ∈ {− 1 , + 1 } A linear classifier takes a vector w ∈ R d and scalar w 0 ∈ R and predicts y i = f ( x i ; w , w 0 ) = sign ( x T i w + w 0 ) . We discussed two methods last time: ◮ Least squares: Sensitive to outliers ◮ Perceptron: Convergence issues, assumes linear separability Can we combine the separating hyperplane idea with probability to fix this?

  4. B AYES LINEAR CLASSIFICATION Linear discriminant analysis We saw an example of a linear classification rule using a Bayes classifier. For the model y ∼ Bern ( π ) and x | y ∼ N ( µ y , Σ) , declare y = 1 given x if ln p ( x | y = 1 ) p ( y = 1 ) p ( x | y = 0 ) p ( y = 0 ) > 0 . In this case, the log odds is equal to ln p ( x | y = 1 ) p ( y = 1 ) ln π 1 − 1 2 ( µ 1 + µ 0 ) T Σ − 1 ( µ 1 − µ 0 ) = p ( x | y = 0 ) p ( y = 0 ) π 0 � �� � a constant w 0 + x T Σ − 1 ( µ 1 − µ 0 ) � �� � a vector w

  5. L OG ODDS AND B AYES CLASSIFICATION Original formulation Recall that originally we wanted to declare y = 1 given x if ln p ( y = 1 | x ) p ( y = 0 | x ) > 0 We didn’t have a way to define p ( y | x ) , so we used Bayes rule: ◮ Use p ( y | x ) = p ( x | y ) p ( y ) and let the p ( x ) cancel each other in the fraction p ( x ) ◮ Define p ( y ) to be a Bernoulli distribution (coin flip distribution) ◮ Define p ( x | y ) however we want (e.g., a single Gaussian) Now, we want to directly define p ( y | x ) . We’ll use the log odds to do this.

  6. L OG ODDS AND B AYES CLASSIFICATION Log odds and hyperplanes x 2 x H Classifying x based on the log odds L = ln p ( y = + 1 | x ) p ( y = − 1 | x ) , w x 1 we notice that 1. L ≫ 0 : more confident y = + 1, 2. L ≪ 0 : more confident y = − 1, − w 0 / � w � 2 3. L = 0 : can go either way The linear function x T w + w 0 captures these three objectives: � � � x T w w 0 ◮ The distance of x to a hyperplane H defined by ( w , w 0 ) is � w � 2 + � . � w � 2 ◮ The sign of the function captures which side x is on. ◮ As x moves away/towards H , we become more/less confident.

  7. L OG ODDS AND HYPERPLANES Logistic link function We can directly plug in the hyperplane representation for the log odds: ln p ( y = + 1 | x ) p ( y = − 1 | x ) = x T w + w 0 Question : What is different from the previous Bayes classifier? Answer : There was a formula for calculating w and w 0 based on the prior model and data x . Now, we put no restrictions on these values. Setting p ( y = − 1 | x ) = 1 − p ( y = + 1 | x ) , solve for p ( y = + 1 | x ) to find exp { x T w + w 0 } 1 + exp { x T w + w 0 } = σ ( x T w + w 0 ) . p ( y = + 1 | x ) = ◮ This is called the sigmoid function . ◮ We have chosen x T w + w 0 as the link function for the log odds.

  8. L OGISTIC SIGMOID FUNCTION 1 0.5 0 −5 0 5 ◮ Red line: Sigmoid function σ ( x T w + w 0 ) , which maps x to p ( y = + 1 | x ) . ◮ The function σ ( · ) captures our desire to be more confident as we move away from the separating hyperplane, defined by the x -axis. ◮ (Blue dashed line: Not discussed.)

  9. L OGISTIC REGRESSION � w 0 � 1 � � As with regression, absorb the offset: w ← and x ← . w x Definition Let ( x 1 , y 1 ) , . . . , ( x n , y n ) be a set of binary labeled data with y ∈ {− 1 , + 1 } . Logistic regression models each y i as independently generated, with e x T i w P ( y i = + 1 | x i , w ) = σ ( x T i w ) , σ ( x i ; w ) = i w . 1 + e x T Discriminative vs Generative classifiers ◮ This is a discriminative classifier because x is not directly modeled. ◮ Bayes classifiers are known as generative because x is modeled. Discriminative: p ( y | x ) Generative: p ( x | y ) p ( y ) .

  10. L OGISTIC REGRESSION LIKELIHOOD Data likelihood Define σ i ( w ) = σ ( x T i w ) . The joint likelihood of y 1 , . . . , y n is n � p ( y 1 , . . . , y n | x 1 , . . . , x n , w ) = p ( y i | x i , w ) i = 1 n � σ i ( w ) 1 ( y i =+ 1 ) ( 1 − σ i ( w )) 1 ( y i = − 1 ) = i = 1 ◮ Notice that each x i modifies the probability of a ‘ + 1’ for its respective y i . ◮ Predicting new data is the same: ◮ If x T w > 0, then σ ( x T w ) > 1 / 2 and predict y = + 1, and vice versa. ◮ We now get a confidence in our prediction via the probability σ ( x T w ) .

  11. L OGISTIC REGRESSION AND MAXIMUM LIKELIHOOD More notation changes Use the following fact to condense the notation: e y i x T e x T e x T i w i w i w � � 1 ( y i =+ 1 ) � � 1 ( y i = − 1 ) = 1 − 1 + e y i x T 1 + e x T 1 + e x T i w i w i w � �� � � �� � � �� � σ i ( y i · w ) σ i ( w ) 1 − σ i ( w ) therefore, the data likelihood can be written compactly as n � p ( y 1 , . . . , y n | x 1 , . . . , x n , w ) = σ i ( y i · w ) i = 1 We want to maximize this over w .

  12. L OGISTIC REGRESSION AND MAXIMUM LIKELIHOOD Maximum likelihood The maximum likelihood solution for w can be written n � = ln σ i ( y i · w ) w ML arg max w i = 1 = L arg max w As with the Perceptron, we can’t directly set ∇ w L = 0, and so we need an iterative algorithm. Since we want to maximize L , at step t we can update n � w ( t + 1 ) = w ( t ) + η ∇ w L , ∇ w L = ( 1 − σ i ( y i · w )) y i x i . i = 1 We will see that this results in an algorithm similar to the Perceptron.

  13. L OGISTIC REGRESSION ALGORITHM ( STEEPEST ASCENT ) Input : Training data ( x 1 , y i ) , . . . , ( x n , y n ) and step size η > 0 1. Set w ( 1 ) = � 0 2. For iteration t = 1 , 2 , . . . do n • Update w ( t + 1 ) = w ( t ) + η � � � 1 − σ i ( y i · w ( t ) ) y i x i i = 1 Perceptron : Search for misclassified ( x i , y i ) , update w ( t + 1 ) = w ( t ) + η y i x i . Logistic regression : Something similar except we sum over all data. ◮ Recall that σ i ( y i · w ) picks out the probability model gives to the observed y i . ◮ Therefore 1 − σ i ( y i · w ) is the probability the model picks the wrong value. ◮ Perceptron is “all-or-nothing.” Either it’s correctly or incorrectly classified. ◮ Logistic regression has a probabilistic “fudge-factor.”

  14. B AYESIAN LOGISTIC REGRESSION Problem : If a hyperplane can separate all training data, then � w ML � 2 → ∞ . This drives σ i ( y i · w ) → 1 for each ( x i , y i ) . Even for nearly separable data it might get a few very wrong in order to be more confident about the rest. This is a case of “over-fitting.” 4 A solution : Regularize w with λ w T w : 2 � n 0 i = 1 ln σ i ( y i · w ) − λ w T w w MAP = arg max w −2 We’ve seen how this corresponds to a −4 Gaussian prior distribution on w . −6 How about the posterior p ( w | x , y ) ? −8 −4 −2 0 2 4 6 8

  15. L APLACE APPROXIMATION

  16. B AYESIAN LOGISTIC REGRESSION Posterior calculation Define the prior distribution on w to be w ∼ N ( 0 , λ − 1 I ) . The posterior is p ( w ) � n i = 1 σ i ( y i · w ) p ( w | x , y ) = � p ( w ) � n i = 1 σ i ( y i · w ) dw This is not a “standard” distribution and we can’t calculate the denominator. Therefore we can’t actually say what p ( w | x , y ) is. Can we approximate p ( w | x , y ) ?

  17. L APLACE APPROXIMATION One strategy Pick a distribution to approximate p ( w | x , y ) . We will say p ( w | x , y ) ≈ Normal ( µ, Σ) . Now we need a method for setting µ and Σ . Laplace approximations Using a condensed notation, notice from Bayes rule that e ln p ( y , w | x ) p ( w | x , y ) = e ln p ( y , w | x ) dw . � We will approximate ln p ( y , w | x ) in the numerator and denominator.

  18. L APLACE APPROXIMATION Let’s define f ( w ) = ln p ( y , w | x ) . Taylor expansions We can approximate f ( w ) with a second order Taylor expansion . Recall that w ∈ R d + 1 . For any point z ∈ R d + 1 , f ( w ) ≈ f ( z ) + ( w − z ) T ∇ f ( z ) + 1 2 ( w − z ) T � � ∇ 2 f ( z ) ( w − z ) The notation ∇ f ( z ) is short for ∇ w f ( w ) | z , and similarly for the matrix of second derivatives. We just need to pick z . The Laplace approximation defines z = w MAP .

  19. L APLACE APPROXIMATION ( SOLVING ) Recall f ( w ) = ln p ( y , w | x ) and z = w MAP . From Bayes rule and the Laplace approximation we now have e f ( w ) p ( w | x , y ) = � e f ( w ) dw e f ( z )+( w − z ) T ∇ f ( z )+ 1 2 ( w − z ) T ( ∇ 2 f ( z ) ) ( w − z ) ≈ � e f ( z )+( w − z ) T ∇ f ( z )+ 1 2 ( w − z ) T ( ∇ 2 f ( z ))( w − z ) dw This can be simplified in two ways, 1. The term e f ( w MAP ) in the numerator and denominator can be viewed as a multiplicative constant since it doesn’t vary in w . They therefore cancel. 2. By definition of how we find w MAP , the vector ∇ w ln p ( y , w | x ) | w MAP = 0.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend