machine learning chenhao tan
play

Machine Learning: Chenhao Tan University of Colorado Boulder - PowerPoint PPT Presentation

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan | Boulder | 1 of 27 Quiz question For a test instance ( x , y ) and a


  1. Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan | Boulder | 1 of 27

  2. Quiz question For a test instance ( x , y ) and a naïve Bayes classifier ˆ P , which of the following statements is true? c ˆ • (A) � P ( y = c | x ) = 1 c ˆ • (B) � P ( x | c ) = 1 c ˆ P ( x | c )ˆ • (C) � P ( c ) = 1 Machine Learning: Chenhao Tan | Boulder | 2 of 27

  3. Overview Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan | Boulder | 3 of 27

  4. Reminder: Logistic Regression 1 P ( Y = 0 | X ) = (1) 1 + exp [ β 0 + � i β i X i ] exp [ β 0 + � i β i X i ] P ( Y = 1 | X ) = (2) 1 + exp [ β 0 + � i β i X i ] • Discriminative prediction: P ( y | x ) • Classification uses: sentiment analysis, spam detection • What we didn’t talk about is how to learn β from data Machine Learning: Chenhao Tan | Boulder | 4 of 27

  5. Objective function Outline Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan | Boulder | 5 of 27

  6. Objective function Logistic Regression: Objective Function Maximize likelihood log P ( y ( j ) | x ( j ) , β )) � Obj ≡ log P ( Y | X , β ) = j � � � � �� β i x ( j ) β i x ( j ) � y ( j ) � � = β 0 + − log 1 + exp β 0 + i i j i i Machine Learning: Chenhao Tan | Boulder | 6 of 27

  7. Objective function Logistic Regression: Objective Function Minimize negative log likelihood (loss) log P ( y ( j ) | x ( j ) , β )) � L ≡ − log P ( Y | X , β ) = − j � � � � �� β i x ( j ) β i x ( j ) � − y ( j ) � � = β 0 + + log 1 + exp β 0 + i i j i i Machine Learning: Chenhao Tan | Boulder | 7 of 27

  8. Objective function Logistic Regression: Objective Function Minimize negative log likelihood (loss) log P ( y ( j ) | x ( j ) , β )) � L ≡ − log P ( Y | X , β ) = − j � � � � �� β i x ( j ) β i x ( j ) � − y ( j ) � � = β 0 + + log 1 + exp β 0 + i i j i i Training data { ( x , y ) } are fixed. Objective function is a function of β . . . what values of β give a good value? Machine Learning: Chenhao Tan | Boulder | 7 of 27

  9. Objective function Logistic Regression: Objective Function Minimize negative log likelihood (loss) log P ( y ( j ) | x ( j ) , β )) � L ≡ − log P ( Y | X , β ) = − j � � � � �� β i x ( j ) β i x ( j ) � − y ( j ) � � = β 0 + + log 1 + exp β 0 + i i j i i Training data { ( x , y ) } are fixed. Objective function is a function of β . . . what values of β give a good value? β ∗ = arg min L ( β ) β Machine Learning: Chenhao Tan | Boulder | 7 of 27

  10. Objective function Convexity L ( β ) is convex for logistic regression. Proof. • Logistic loss − yv + log( 1 + exp( v )) is convex. • Composition with linear function maintains convexity. • Sum of convex functions is convex. Machine Learning: Chenhao Tan | Boulder | 8 of 27

  11. Gradient Descent Outline Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan | Boulder | 9 of 27

  12. Gradient Descent Convexity • Convex function • Doesn’t matter where you start, if you go down along the gradient Machine Learning: Chenhao Tan | Boulder | 10 of 27

  13. Gradient Descent Convexity • Convex function • Doesn’t matter where you start, if you go down along the gradient • Gradient! Machine Learning: Chenhao Tan | Boulder | 10 of 27

  14. Gradient Descent Convexity • It would have been much harder if this is not convex. Machine Learning: Chenhao Tan | Boulder | 11 of 27

  15. Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective Objective Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27

  16. Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27

  17. Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β 0 Objective Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27

  18. Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β 0 Objective Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27

  19. Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β 0 Objective 1 Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27

  20. Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β 0 Objective 1 Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27

  21. Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β 0 Objective 1 2 Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27

  22. Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β 0 Objective 1 2 Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27

  23. Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β 0 Objective 1 2 3 Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27

  24. Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β j − η∂ L β l + 1 = β l j ∂β j Machine Learning: Chenhao Tan | Boulder | 12 of 27

  25. Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β j − η∂ L β l + 1 = β l j ∂β j Luckily, (vanilla) logistic regression is convex Machine Learning: Chenhao Tan | Boulder | 12 of 27

  26. Gradient Descent Gradient for Logistic Regression To ease notation, let’s define exp β T x i π i = (3) 1 + exp β T x i Our objective function is � − log π i if y i = 1 � � � L = − log p ( y i | x i ) = L i = (4) − log( 1 − π i ) if y i = 0 i i i Machine Learning: Chenhao Tan | Boulder | 13 of 27

  27. Gradient Descent Taking the Derivative Apply chain rule:  ∂π i − 1 ∂ L i ( � if y i = 1 ∂ L β )  π i ∂β j � � = = (5) � � − ∂π i ∂β j ∂β j 1 − if y i = 0  1 − π i ∂β j i i If we plug in the derivative, ∂π i = π i ( 1 − π i ) x j , (6) ∂β j we can merge these two cases ∂ L i = − ( y i − π i ) x j . (7) ∂β j Machine Learning: Chenhao Tan | Boulder | 14 of 27

  28. Gradient Descent Gradient for Logistic Regression Gradient � ∂ L ( � , . . . , ∂ L ( � � β ) β ) ∇ β L ( � β ) = (8) ∂β 0 ∂β n Update ∆ β ≡ η ∇ β L ( � β ) (9) i ← β i − η∂ L ( � β ) β ′ (10) ∂β i Machine Learning: Chenhao Tan | Boulder | 15 of 27

  29. Gradient Descent Gradient for Logistic Regression Gradient � � ∂ L ( � , . . . , ∂ L ( � β ) β ) ∇ β L ( � β ) = (8) ∂β 0 ∂β n Update ∆ β ≡ η ∇ β L ( � β ) (9) i ← β i − η∂ L ( � β ) β ′ (10) ∂β i Why are we subtracting? What would we do if we wanted to do ascent ? Machine Learning: Chenhao Tan | Boulder | 15 of 27

  30. Gradient Descent Gradient for Logistic Regression Gradient � � ∂ L ( � , . . . , ∂ L ( � β ) β ) ∇ β L ( � β ) = (8) ∂β 0 ∂β n Update ∆ β ≡ η ∇ β L ( � β ) (9) i ← β i − η∂ L ( � β ) β ′ (10) ∂β i η : step size, must be greater than zero Machine Learning: Chenhao Tan | Boulder | 15 of 27

  31. Gradient Descent Choosing Step Size Objective Parameter Machine Learning: Chenhao Tan | Boulder | 16 of 27

  32. Gradient Descent Choosing Step Size Objective Parameter Machine Learning: Chenhao Tan | Boulder | 16 of 27

  33. Gradient Descent Choosing Step Size Objective Parameter Machine Learning: Chenhao Tan | Boulder | 16 of 27

  34. Gradient Descent Choosing Step Size Objective Parameter Machine Learning: Chenhao Tan | Boulder | 16 of 27

  35. Gradient Descent Remaining issues • When to stop? • What if β keeps getting bigger? Machine Learning: Chenhao Tan | Boulder | 17 of 27

  36. Gradient Descent Regularized Conditional Log Likelihood Unregularized β ∗ = arg min p ( y ( j ) | x ( j ) , β ) � � − ln (11) β Regularized + 1 β ∗ = arg min p ( y ( j ) | x ( j ) , β ) � � � β 2 − ln 2 µ (12) i β i Machine Learning: Chenhao Tan | Boulder | 18 of 27

  37. Gradient Descent Regularized Conditional Log Likelihood Unregularized β ∗ = arg min � p ( y ( j ) | x ( j ) , β ) � − ln (11) β Regularized + 1 β ∗ = arg min � p ( y ( j ) | x ( j ) , β ) � � β 2 − ln 2 µ (12) i β i µ is “regularization” parameter that trades off between likelihood and having small parameters Machine Learning: Chenhao Tan | Boulder | 18 of 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend