learning from data lecture 9 logistic regression and
play

Learning From Data Lecture 9 Logistic Regression and Gradient - PowerPoint PPT Presentation

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression Gradient Descent M. Magdon-Ismail CSCI 4100/6100 recap: Linear Classification and Regression The linear signal: s = w t x Good Features are Important


  1. Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression Gradient Descent M. Magdon-Ismail CSCI 4100/6100

  2. recap: Linear Classification and Regression The linear signal: s = w t x Good Features are Important Algorithms Linear Classification. Pocket algorithm can tolerate errors Simple and efficient Linear Regression. Single step learning: y w = X † y = (X t X) − 1 X t y Very efficient O ( Nd 2 ) exact algorithm. Before looking at the data, we can reason that x 1 symmetry and intensity should be good features x 2 based on our knowledge of the problem . M Logistic Regression and Gradient Descent : 2 /23 � A c L Creator: Malik Magdon-Ismail Predicting a probability − →

  3. Predicting a Probability Will someone have a heart attack over the next year? age 62 years gender male blood sugar 120 mg/dL40,000 HDL 50 LDL 120 Mass 190 lbs 5 ′ 10 ′′ Height . . . . . . Classification: Yes/No Logistic Regression: Likelihood of heart attack logistic regression ≡ y ∈ [0 , 1] � d � � = θ ( w t x ) h ( x ) = θ w i x i i =0 M Logistic Regression and Gradient Descent : 3 /23 � A c L Creator: Malik Magdon-Ismail What is θ ? − →

  4. Predicting a Probability Will someone have a heart attack over the next year? age 62 years gender male blood sugar 120 mg/dL40,000 HDL 50 LDL 120 Mass 190 lbs 5 ′ 10 ′′ Height . . . . . . Classification: Yes/No Logistic Regression: Likelihood of heart attack logistic regression ≡ y ∈ [0 , 1] 1 � d e s 1 � θ ( s ) = 1 + e s = 1 + e − s . θ ( s ) � = θ ( w t x ) h ( x ) = θ w i x i 1 e − s θ ( − s ) = 1 + e − s = 1 + e s = 1 − θ ( s ) . 0 i =0 s M Logistic Regression and Gradient Descent : 4 /23 � A c L Creator: Malik Magdon-Ismail Data is binary ± 1 − →

  5. The Data is Still Binary, ± 1 D = ( x 1 , y 1 = ± 1) , · · · , ( x N , y N = ± 1) x n ← a person’s health information y n = ± 1 ← did they have a heart attack or not We cannot measure a probability . We can only see the occurence of an event and try to infer a probability. M Logistic Regression and Gradient Descent : 5 /23 � A c L Creator: Malik Magdon-Ismail f is noisy − →

  6. The Target Function is Inherently Noisy f ( x ) = P [ y = +1 | x ] . The data is generated from a noisy target function:  f ( x ) for y = +1;   P ( y | x ) =   1 − f ( x ) for y = − 1 . M Logistic Regression and Gradient Descent : 6 /23 � A c L Creator: Malik Magdon-Ismail When is h good? − →

  7. What Makes an h Good? ‘fitting’ the data means finding a good h  h ( x n ) ≈ 1 whenever y n = +1;  h is good if:  h ( x n ) ≈ 0 whenever y n = − 1 . A simple error measure that captures this: N E in ( h ) = 1 � 2 . � � h ( x n ) − 1 2 (1 + y n ) N n =1 Not very convenient (hard to minimize). M Logistic Regression and Gradient Descent : 7 /23 � A c L Creator: Malik Magdon-Ismail Cross entropy error − →

  8. The Cross Entropy Error Measure N E in ( w ) = 1 � ln(1 + e − y n · w t x ) N n =1 It looks complicated and ugly (ln , e ( · ) , . . . ), But, – it is based on an intuitive probabilistic interpretation of h . – it is very convenient and mathematically friendly (‘easy’ to minimize). Verify: y n = +1 encourages w t x n ≫ 0, so θ ( w t x n ) ≈ 1; y n = − 1 encourages w t x n ≪ 0, so θ ( w t x n ) ≈ 0; M Logistic Regression and Gradient Descent : 8 /23 � A c L Creator: Malik Magdon-Ismail Probabilistic interpretation − →

  9. The Probabilistic Interpretation Suppose that h ( x ) = θ ( w t x ) closely captures P [+1 | x ]:  θ ( w t x ) for y = +1;   P ( y | x ) =   1 − θ ( w t x ) for y = − 1 . M Logistic Regression and Gradient Descent : 9 /23 � A c L Creator: Malik Magdon-Ismail 1 − θ ( s ) = θ ( − s ) − →

  10. The Probabilistic Interpretation So, if h ( x ) = θ ( w t x ) closely captures P [+1 | x ]:  θ ( w t x ) for y = +1;   P ( y | x ) =   θ ( − w t x ) for y = − 1 . M Logistic Regression and Gradient Descent : 10 /23 � A c L Creator: Malik Magdon-Ismail Simplify to one equation − →

  11. The Probabilistic Interpretation So, if h ( x ) = θ ( w t x ) closely captures P [+1 | x ]:  θ ( w t x ) for y = +1;   P ( y | x ) =   θ ( − w t x ) for y = − 1 . . . . or, more compactly, P ( y | x ) = θ ( y · w t x ) M Logistic Regression and Gradient Descent : 11 /23 � A c L Creator: Malik Magdon-Ismail The likelihood − →

  12. The Likelihood P ( y | x ) = θ ( y · w t x ) Recall: ( x 1 , y 1 ) , . . . , ( x N , y N ) are independently generated Likelihood : The probability of getting the y 1 , . . . , y N in D from the corresponding x 1 , . . . , x N : N � P ( y 1 , . . . , y N | x 1 , . . . , x n ) = P ( y n | x n ) . n =1 The likelihood measures the probability that the data were generated if f were h . M Logistic Regression and Gradient Descent : 12 /23 � A c L Creator: Malik Magdon-Ismail Maximize the likelihood − →

  13. Maximizing The Likelihood (why?) � N max n =1 P ( y n | x n ) �� N � ⇔ max ln n =1 P ( y n | x n ) � N ≡ max n =1 ln P ( y n | x n ) � N − 1 ⇔ min n =1 ln P ( y n | x n ) N � N 1 1 ≡ min n =1 ln P ( y n | x n ) N � N 1 1 ≡ min n =1 ln ← we specialize to our “model” here θ ( y n · w t x n ) N � N 1 n =1 ln(1 + e − y n · w t x n ) ≡ min N N E in ( w ) = 1 � ln(1 + e − y n · w t x n ) N n =1 M Logistic Regression and Gradient Descent : 13 /23 � A c L Creator: Malik Magdon-Ismail How to minimize E in ( w ) − →

  14. How To Minimize E in ( w ) Classification – PLA/Pocket (iterative) Regression – pseudoinverse (analytic), from solving ∇ w E in ( w ) = 0 . Logistic Regression – analytic won’t work. Numerically/iteratively set ∇ w E in ( w ) → 0 . M Logistic Regression and Gradient Descent : 14 /23 � A c L Creator: Malik Magdon-Ismail Hill analogy − →

  15. Finding The Best Weights - Hill Descent Ball on a complicated hilly terrain — rolls down to a local valley ↑ this is called a local minimum Questions: How to get to the bottom of the deepest valey? How to do this when we don’t have gravity? M Logistic Regression and Gradient Descent : 15 /23 � A c L Creator: Malik Magdon-Ismail Our E in is convex − →

  16. Our E in Has Only One Valley In-sample Error, E in Weights, w . . . because E in ( w ) is a convex function of w . (So, who care’s if it looks ugly!) M Logistic Regression and Gradient Descent : 16 /23 � A c L Creator: Malik Magdon-Ismail How to roll down? − →

  17. How to “Roll Down”? Assume you are at weights w ( t ) and you take a step of size η in the direction ˆ v . w ( t + 1) = w ( t ) + η ˆ v We get to pick ˆ v ← what’s the best direction to take the step? Pick ˆ v to make E in ( w ( t + 1)) as small as possible. M Logistic Regression and Gradient Descent : 17 /23 � A c L Creator: Malik Magdon-Ismail The gradient − →

  18. The Gradient is the Fastest Way to Roll Down Approximating the change in E in ∆ E in = E in ( w ( t + 1)) − E in ( w ( t )) = E in ( w ( t ) + η ˆ v ) − E in ( w ( t )) + O ( η 2 ) = η ∇ E in ( w ( t )) t ˆ v (Taylor’s Approximation) � �� � v = − ∇ E in ( w ( t )) minimized at ˆ | | ∇ E in ( w ( t )) | | > v = − ∇ E in ( w ( t )) ≈ − η | | ∇ E in ( w ( t )) | | ← attained at ˆ | | ∇ E in ( w ( t )) | | The best (steepest) direction to move is the negative gradient: v = − ∇ E in ( w ( t )) ˆ | | ∇ E in ( w ( t )) | | M Logistic Regression and Gradient Descent : 18 /23 � A c L Creator: Malik Magdon-Ismail Iterate the gradient − →

  19. “Rolling Down” ≡ Iterating the Negative Gradient w (0) ↓ ← negative gradient w (1) ↓ ← negative gradient w (2) ↓ ← negative gradient w (3) ↓ ← negative gradient . . . η = 0 . 5; 15 steps M Logistic Regression and Gradient Descent : 19 /23 � A c L Creator: Malik Magdon-Ismail What step size? − →

  20. The ‘Goldilocks’ Step Size η too small η too large variable η t – just right large η In-sample Error, E in In-sample Error, E in In-sample Error, E in small η Weights, w Weights, w Weights, w η = 0 . 1; 75 steps η = 2; 10 steps variable η t ; 10 steps M Logistic Regression and Gradient Descent : 20 /23 � A c L Creator: Malik Magdon-Ismail Fixed learning rate gradient descent − →

  21. Fixed Learning Rate Gradient Descent η t = η · | | ∇ E in ( w ( t )) | | 1: Initialize at step t = 0 to w (0). 2: for t = 0 , 1 , 2 , . . . do Compute the gradient | | ∇ E in ( w ( t )) | | → 0 when closer to the minimum. 3: ← − (Ex. 3.7 in LFD) g t = ∇ E in ( w ( t )) . ∇ E in ( w ( t )) v = − η t · ˆ Move in the direction v t = − g t . | | ∇ E in ( w ( t )) | | 4: Update the weights: ∇ E in ( w ( t )) 5: = − η · | | ∇ E in ( w ( t )) | | · | | ∇ E in ( w ( t )) | | w ( t + 1) = w ( t ) + η v t . Iterate ‘until it is time to stop’. 6: 7: end for v = − η · ∇ E in ( w ( t )) ˆ 8: Return the final weights. Gradient descent can minimize any smooth function, for example N E in ( w ) = 1 � ln(1 + e − y n · w t x ) ← logistic regression N n =1 M Logistic Regression and Gradient Descent : 21 /23 � A c L Creator: Malik Magdon-Ismail Stochastic gradient descent − →

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend